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■ We study generalized bootstrap confidence regions for the mean 

of a random vector whose coordinates have an unknown dependency 
structure. The random vector is supposed to be either Gaussian or 
to have a symmetric and bounded distribution. The dimensionality 
of the vector can possibly be much larger than the number of obser- 
vations and we focus on a nonasymptotic control of the confidence 
level, following ideas inspired by recent results in learning theory. We 
consider two approaches, the first based on a concentration principle 
' (valid for a large class of resampling weights) and the second on a 

resampled quantile, specifically using Rademacher weights. Several 
intermediate results established in the approach based on concentra- 
tion principles are of interest in their own right. We also discuss the 
question of accuracy when using Monte Carlo approximations of the 
fT^ ■ resampled quantities. 

> 

in . 

t^. , 1. Introduction. 

. 1.1. Goals and motivations. Let Y := (Y 1 , . . . ,Y n ) be a sample of n > 2 

i.i.d. observations of an integrable random vector in M. K , with dimensionality 
K possibly much larger than n and unknown dependency structure of the 
coordinates. Let /j, 6 WL K denote the common mean of the Y\ Our goal is 
to find a nonasymptotic (1 — a)-confidence region G(Y, 1 — a) for fj,, of the 
form 



(1) G(Y, l-a) = {xe R K \<P(Y -x)< t a (Y)}, 
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where (j> : R — > R is a function which is fixed in advance (measuring a kind 
of distance, e.g., an £ p -norm for p G [l,oo]), a E (0, 1), t a : (M. K ) n — > R is a 
possibly data-dependent threshold and Y = ^ X^"=i ^ ^ ^ K ^ s the empirical 
mean of the sample Y. 

The point of view developed in the present work focuses on the following 
goal: 

• obtaining nonasymptotic results, valid for any fixed K and n, with K 
possibly much larger than the number of observations n, while 

• avoiding any specific assumption on the dependency structure of the co- 
ordinates of Y* (although we will consider some general assumptions over 
the distribution of Y, namely symmetry and boundedness or Gaussian- 

ity). 

In the Gaussian case, a traditional parametric method based on the direct 
estimation of the covariance matrix to derive a confidence region would not 
be appropriate in the situation where K 3> n, unless the covariance matrix 
is assumed to belong to some parametric model of lower dimension, which 
we explicitly do not want to posit here. In this sense, the approach followed 
here is closer in spirit to nonparametric or semiparametric statistics. 

This point of view is motivated by some practical applications, especially 
neuroimaging [8, 18, 26]. In a magnetoencephalography (MEG) experiment, 
each observation Y* is a two- or three-dimensional brain activity map, ob- 
tained as a difference between brain activities in the presence or absence of 
some stimulation. The activity map is typically composed of about 15,000 
points; the data can also be a time series of length between 50 and 1000 
such maps. The dimensionality K can thus range from 10 4 to 10 . Such ob- 
servations are repeated from n = 15 up to 4000 times, but this upper bound 
is seldom attained [32]; in typical cases, one has n < 100 <C K. In such 
data, there are strong dependencies between locations (the 15,000 points 
are obtained by pre-processing data from 150 sensors) and these dependen- 
cies are spatially highly nonhomogeneous, as noted by [26]. Moreover, there 
may be long-distance correlations, for example, depending on neural connec- 
tions inside the brain, so that a simple parametric model of the dependency 
structure is generally not adequate. Another motivating example is given by 
microarray data [14], where it is common to observe samples of limited size 
(say, less than 100) of a vector in high dimension (say, more than 20,000, 
each dimension corresponding to a specific gene) and where the dependency 
structure may be quite arbitrary. 

1.2. Two approaches to our goal. The ideal threshold t a in (1) is obvi- 
ously the (1 — a) quantile of the distribution of 0(Y — /x). However, this 
quantity depends on the unknown dependency structure of the coordinates 
of Y l and is therefore itself unknown. 
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The approach studied in this work is to use a (generalized) resampling 
scheme in order to estimate t a . The heuristics of the resampling method 
(introduced in [11], generalized to exchangeable weighted bootstrap by [23, 
28] ) is that the distribution of the unobservable variable Y — [i is "mimicked" 
by the distribution, conditionally on Y, of the resampled empirical mean of 
the centered data. This last quantity is an observable variable and we denote 
it as follows: 

(2) y( w ~ w ) ■= Ly \Wi - W)Y i = -Y 'Wi(X -Y) = (Y -Y)< w >, 

i=l i=l 

where (Wi)i<j< n are real random variables independent of Y, called the 
resampling weights, and W = n~ 1 ^2™ =1 Wi. We emphasize that the weight 
family (Wi)i<j< n itself need not be independent. 

In Section 2.4, we define in more detail several specific resampling weights 
inspired both by traditional resampling methods [23, 28] and by recent sta- 
tistical learning theory. Let us give two typical examples reflecting these two 
sources: 

• Efron's bootstrap weights. W is a multinomial random vector with param- 
eters (n;re _1 , . . . This is the standard bootstrap. 

• Rademacher weights. Wi are i.i.d. Rademacher variables, that is, Wi 6 
{ — 1,1} with equal probabilities. They are closely related to symmetriza- 
tion techniques in learning theory. 

It is useful to observe at this point that, to the extent that we only consider 
resampled data after empirical centering, shifting all weights by the same 
(but possibly random) offset C > does not change the resampled quantity 
introduced in (2). Hence, to reconcile the intuition of traditional resampling 
with what could possibly appear as unfamiliar weights, one could always 
assume that the weights are translated to enforce (for example) weight pos- 
itivity or the condition n~ l Yli=i Wi = 1 (although, of course, in general, 
both conditions cannot be ensured at the same time simply by transla- 
tion). For example, Rademacher weights can be interpreted as a resampling 
scheme where each Y* is independently discarded or "doubled" with equal 
probability. 

Following the general resampling idea, we investigate two distinct ap- 
proaches in order to obtain nonasymptotic confidence regions: 

• Approach 1 ("concentration approach," developed in Section 2). 

The expectations of 0(Y — fj) and (f)(Y^ w ~ w ^) can be precisely com- 
pared and the processes 0(Y — /x) and Ew^Y^ - ^)] concentrate well 
around their respective expectations, where Ew denotes the expectation 
operator with respect to the distribution of W (i.e., conditionally on Y). 
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• Approach 2 ("direct quantile approach," developed in Section 3). 

The 1 — a quantile of the distribution of (f>(Y^ w ~ w ^) conditionally on 
Y is close to the 1 — a quantile of 0(Y — fi). 

Regarding the second approach, we will restrict ourselves specifically to 
Rademacher weights in our analysis and rely heavily on a symmetrization 
principle. 

1.3. Relation to previous work. Using resampling to construct confidence 
regions is a vast field of study in statistics (see, e.g., [4, 9, 11, 15, 16, 27]). 
Available results are, however, mostly asymptotic, based on the celebrated 
fact that the bootstrap process is asymptotically close to the original em- 
pirical process [31]. Because we focus on a nonasymptotic viewpoint, this 
asymptotic approach is not adapted to the goals we have fixed. Note, also, 
that the nonasymptotic viewpoint can be used as a basis for an asymptotic 
analysis in the situation where the dimension K grows with n, a setting 
which is typically not covered by standard asymptotics. 

The "concentration approach" mentioned in the previous subsection is 
inspired by recent results coming from learning theory and relates, in par- 
ticular, to the notion of Rademacher complexity [20]. This notion has been 
extended in the recent work of Fromont [13] to more general resampling 
schemes and this latter work has had a strong influence on the present pa- 
per. 

On the other hand, what we called the "quantile approach" in the previ- 
ous subsection is strongly related to exact randomization tests (which are 
based on an invariance of the null distribution under a given transformation; 
the underlying idea can be traced back to Fisher's permutation test [12]). 
Namely, we will only consider symmetric distributions; this is a specific in- 
stance of an invariance with respect to a transformation and will allow us to 
make use of distribution-preserving randomization via sign reversal. Here, 
the main difference with traditional exact randomization tests is that, since 
our goal is to derive a confidence region, the vector of the means is unknown 
and, therefore, so is the exact invariant transformation. Our contribution to 
this point is essentially to show that the true vector of the means can be 
replaced by the empirical one in the randomization, at the cost of additional 
terms of smaller order in the threshold thus obtained. To our knowledge, this 
gives the first nonasymptotic approximation result on resampled quantiles 
with an unknown distribution mean. 

Finally, we contrast the setting studied here with a strand of research 
studying adaptive confidence regions (in a majority of cases, ^2-balls) in 
nonparametric Gaussian regression. A seminal paper on this topic is [22] 
and recent work includes [17, 21, 29] (from an asymptotic point of view) 
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and [3, 5, 6, 19] (which present nonasymptotic results). Related to this set- 
ting and ours is [10], where adaptive tests for zero mean are developed for 
symmetric distributions, using randomization by sign reversal. The setting 
considered in these papers is that of regression on a fixed design in high di- 
mension (or in the Gaussian sequence model) with one observation per point 
and i.i.d. noise. This corresponds (in our notation) to n = 1, while the K co- 
ordinates are assumed independent. Despite some similarities, the problem 
considered here is of a different nature: in the aforementioned works, the fo- 
cus is on adaptivity with respect to some properties of the true mean vector, 
concretized by a family of models (e.g., linear subspaces or Besov balls in 
the Gaussian sequence setting); usually, an adaptive estimator performing 
implicit or explicit model selection relative to this collection is studied and 
a crucial question for obtaining confidence regions is that of empirically es- 
timating the bias of this estimator when the noise dependence structure is 
known. In the present paper, we do not consider the problem of model selec- 
tion, but the focus is on evaluating the estimation error under an unknown 
noise dependence structure (for the "naive" unbiased estimator given by the 
empirical mean). 

1.4. Notation. We first introduce some notation that will be useful through- 
out the paper. 

• A boldface letter indicates a matrix. This will almost exclusively concern 
the K x n data matrix Y. A superscript index such as Y* indicates the 
ith. column of a matrix. 

• If /j, € R , Y — jj, is the matrix obtained by subtracting [i from each 
(column) vector of Y. Similarly, for a vector W £ W 1 and c£K, we denote 
W-c:=(Wi -c)i<i<„ et 11 . 

• If X is a random variable, then T>(X) is its distribution and Var(A) 
is its variance. We use the notation X ~ Y to indicate that X and Y 
have the same distribution. Moreover, the support of D(X) is denoted by 
suppP(A). 

• We denote by Ew[-] the expectation operator over the distribution of the 
weight vector W only, that is, conditional on Y. We use a similar notation, 
Pjy, f° r the corresponding probability operator and Ey,IPy for the same 
operations conditional on W. Since Y and W are always assumed to be 
independent, the operators Ejy and Ey commute by Fubini's theorem. 

• The vector a = {o~k)i<k<K is the vector of the standard deviations of the 
data: Vfc, l<k<K, a k ~= Var 1 / 2 ^). 

• $ is the standard Gaussian upper tail function: if X ~ A/"(0, 1), Vx 6 R, 
$(x) :=F(X > x). 

• We define the mean of the weight vector W := - Y17=i ^> empirical 
mean vector Y := — Y17=i ^* an( ^ the- resampled empirical mean vector 
y(w) := Iy« W y\ 
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• We use the operator | • | to denote the cardinality of a set. 

• For two positive sequences (u n ) n and (v n ) n , we write u n = Q(v n ) when 
(u n v~ 1 ) n stays bounded away from zero and +00. 

Several properties may be assumed for the function </> : M. K — > R used to 
define confidence regions of the form (1): 

• Subadditivity: Vx,x' G R K ,(j)(x + x') < 4>{x) +4>{x'). 

• Positive homogeneity: Vx S M. K , VA £ M + , (j)(Xx) = \cj)(x). 

• Boundedness by the ^ p -norm, p € [l,oo]: Vx 6 M. , \cf>(x)\ < \\x\\ p , where 
||x|| p is equal to (X^jfcLi \ x k\ p ) 1 ^ p if p < 00 and maxfc{|xfc|} for p = +00. 
Note, also, that all of the results in this paper are still valid with any 
normalization of the f p -norm [in particular, it can be taken to be equal to 

(K^ 1 Hk=i\ x k\ p ) 1/p so that the ^p-norm of a vector with equal coordinates 
does not depend on the dimensionality K\. 

Finally, we introduce the following possible assumptions on the generating 
distribution of Y: 

(GA) The Gaussian assumption: the are Gaussian vectors. 
(SA) The symmetric assumption: the Y* are symmetric with respect to fi, 

that is, (Y i -/i)~( y u- Y*). 
(BA) (p,M) the boundedness assumption: ||Y l — fi\\ p < M a.s. 

In this paper, we primarily focus on the Gaussian framework (GA), where 
the corresponding results will be more accurate. In the sequel, when consid- 
ering (GA) and the assumption that (j) is bounded by the £ p -norm for some 
p > 1, we will additionally always assume that we know some upper bound 
on the ^p-norm of a. The question of finding an upper bound for ||o"|| p based 
on the data is discussed in Section 4.1. 

2. Confidence region using concentration. 

2.1. Main result. We consider here a general resampling weight vector 
W, that is, an M n -valued random vector W = (W / j)i<j< n independent of Y 
and satisfying the following properties: for all i G {1, . . . , n}, K[VFj 2 ] < 00 and 

n- 1 J27=i^\w i -W\>o. 

In this section, we will mainly consider an exchangeable resampling weight 
vector, that is, a resampling weight vector W such that (Wi)i<i< n has an 
exchangeable distribution (in other words, it is invariant under any permu- 
tation of the indices). Several examples of exchangeable resampling weight 
vectors are given in Section 2.4, where we also address the question of how 
to choose between different possible distributions of W. An extension of our 
results to nonexchangeable weight vectors is proposed in Section 2.5.1. 
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Table 1 

Resampling constants for some classical resampling weight vectors 



Efron 

Efron, n — > +00 

Rademacher 
Rademacher, n — > +00 

rho(g) 

rho(n/2) 
Leave-one-out 
Regular V-fcv 



2(1 - i)» = A w <B W < ^2=1, CV = 1 

— A \y < ^ 1 — Cw 



±=Aw<B W < yjl 

Aw Cw TDv^ = 1 



3-, C w = KD W < 1+ 4= 



A W = 2(1-^),B W = ^2_1 



7-1-1 

Aw = -Bw 
^ = Aw < Bw 



q -l, D w = % 



1 - 



2q 



1, CV 



A w = £ < B w = -^-j, CV = v^(V - I)" 1 , £V = 1 



Four constants that depend only on the distribution of W appear in the 
results below (the fourth one is defined only for a particular class of weights) . 
They are defined as follows and computed for classical resamplings in Table 
1: 



(3) 
(4) 

(5) 
(6) 



A w 
Cw 



K\Wi-W\; 



E 



lJ2(w t -wf 



i=i 



n 



n — 1 
= a + E\W-x 

if, for all i,\Wi 



E[(Wi - Wf 



1/2 



1/2 



a a.s. (with a > 0, xq £ 



Note that these quantities are positive for an exchangeable resampling weight 
vector W and satisfy 



< A w < B w < CwV 1 ~ I/"- 

Moreover, if the weights are i.i.d., we have Cw = Var(Wi) 1 / 2 . We can now 
state the main result of this section. 



Theorem 2.1. Fix a <E (0,1) and p£ [l,oo]. Let 0:R K ->■ M 6e any 
function which is subadditive, positive homogeneous and bounded by the £ p - 
norm, and let W be an exchangeable resampling weight vector. 
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1. If*Y satisfies (GA), then 



(7) 



(Y-/x)< 



■^ + IH| P $ \a/2) 



Bw 



nBy/ yfn 



holds with probability at least 1 — a. The same bound holds for the lower 
deviations, that is, with inequality (7) reversed and the additive term 
replaced by its opposite. 
2. IfY satisfies (SA) and (BA) (p,M) for some M > 0, then 



(8) 



(Y-//)< 



E 



in 



4 ^ + _yiog 1/q 



holds with probability at least 1 — a. If, moreover, the weight vector sat- 
isfies the assumption of (6), then 



(9) 



(y-m)> 



E 



in 



(Y<w-W0)] M 



42 



— A/l + ^V21og(l/a) 



/io/ds wt/j probability at least 1 — a. 



Inequalities (7) and (8) give regions of the form (1) that are confidence 
regions of level at least 1 — a. They require knowledge of some upper bound 
on \\cr\\p (resp., M) or a good estimate of it. We address this question in 
Section 4.1. 

In order to obtain some insight into these bounds, it is useful to com- 
pare them with an elementary inequality. In the Gaussian case, it is true for 
each coordinate k,l < k < K, that the following inequality holds with prob- 
ability 1 — a: | Yfc — /Ltfe | < (a/2). By applying a simple union bound 
over the coordinates and using the fact that <j> is positive homogeneous and 
bounded by the £ p -norm, we conclude that the following inequality holds 
with probability at least 1 — a: 

(10) 0(Y- M ) < =:t B onf(«), 

which is a minor variation on the well-known Bonferroni bound. By compar- 
ison, the main term in the remainder part of (7) takes a similar form, but 
with K replaced by 1: the remainder term is dimension-independent. Natu- 
rally, the "dimension complexity" has not disappeared, but will be taken into 
account in the main resampled term instead. When K is large, the bound 
(7) can improve on the Bonferroni threshold if there are strong dependen- 
cies between the coordinates, resulting in a significantly smaller resampling 
term. 

By way of illustration, consider an extreme example where all pairwise 
coordinate correlations are exactly 1, that is, the random vector Y is made 
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of K copies of the same random variable so that there is, in fact, no dimen- 
sion complexity. Take <j)(X) = supj X{ (corresponding to a uniform one-sided 
confidence bound for the mean components). Then the resampled quantity 
in (7) is equal to zero and the obtained bound is close to optimal (up to 
the two following points: the level is divided by a factor of 2 and there is 
an additional term of order -). By comparison, the Bonferroni bound di- 
vides the level by a factor of K, resulting in a significantly worse threshold. 
In passing, note that this example illustrates that the order n -1 / 2 of the 
remainder term cannot be improved. 

If we now interpret the bound (7) from an asymptotic point of view [with 
K(n) depending on n and ||c|L = 0(1)], then the rate of convergence to zero 
cannot be faster than n -1 / 2 (which corresponds to the standard parametric 
rate when K is fixed), but it can be potentially slower, for example, if K 
increases exponentially with n. In the latter case, the rate of convergence 
of the Bonferroni threshold is always strictly slower than n -1 / 2 . In general, 
as far as the order in n is concerned, the resampled threshold converges at 
least as fast as Bonferroni's, but whether it is strictly faster depends once 
again on the coordinate dependency structure. 

However, if the coordinates are only "weakly dependent," then the thresh- 
old (7) can be more conservative than Bonferroni's by a multiplicative factor, 
while the Bonferroni threshold can sometimes be essentially optimal (e.g., 
with 4> = || • llocn ah of the coordinates independent and with small a). This 
motivates the next result, where we assume, more generally, that an alter- 
nate analysis of the problem can lead to deriving a deterministic threshold 
t a such that F(cf)(Y — fj) > t a ) < a. In this case, we would ideally like to 
take the "best of two approaches" and consider the minimum of t a and 
the resampling-based thresholds considered above. In the Gaussian case, 
the following proposition establishes that we can combine the concentration 
threshold corresponding to (7) with t a to obtain a threshold that is very 
close to the minimum of the two. 

Proposition 2.2. Fix a, 5 e (0,1), pe [l,oo] and take (f> and W as 
in Theorem 2.1. Suppose that Y satisfies (GA) and that i a n_<5) is a real 
number such that P(0(Y — fj,) > i Q (i_5)) < a(l — 5). Then, with probability 
at least 1 — a, 4>(Y — /x) is less than or equal to the minimum of i a (i-<5) and 



(11) 




nBw 




The important point to note in Proposition 2.2 is that, since the last term 
of (11) becomes negligible with respect to the rest when n grows large, we 
can choose 5 to be quite small [typically 5 = 0(l/n)] and obtain a threshold 
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very close to the minimum of t a and the threshold corresponding to (7). 
Therefore, this result is more subtle than just considering the minimum of 
two thresholds each taken at level 1 — §, as would be obtained by a direct 
union bound. 

The proof of Theorem 2.1 involves results which are of interest in their 
own right: the comparison between the expectations of the two processes 
Ew^Y^ - ^)] and 4>(Y — and the concentration of these processes 
around their means. These two issues are, respectively, examined in the two 
next Sections 2.2 and 2.3. In Section 2.4, we provide some elements for 
an appropriate choice of resampling weight vectors among several classical 
examples. The final Section 2.5 tackles the practical issue of computation 
time. 

2.2. Comparison in expectation. In this section, we compare E^Y^ -1 ^)] 
and E[c/>(Y — fi)] . We note that these expectations exist in the Gaussian (GA) 
and the bounded (BA) cases, provided that <p is measurable and bounded 
by an £ p -norm. Otherwise (in particular, in Propositions 2.3 and 2.4), we 
assume that these expectations exist. 

In the Gaussian case, these quantities are equal up to a factor that depends 
only on the distribution of W. 

Proposition 2.3. Let Y be a sample satisfying (GA) and let W be a 
resampling weight vector. Then, for any measurable positive homogeneous 
function <f> : ~R K — > M, we have the following equality: 

(12) B w E[<f>(Y-fj,)) =E[0(Y< w - W >)]. 

If the weights are such that Y17=i0^i — W) 2 = n, then the above equality 
holds for any function <j) (and Byy = 1). 

For some classical weights, we give bounds or exact expressions for Bw 
in Table 1. In general, we can compute the value of Byy by simulation. Note 
that in a non-Gaussian framework, the constant Byy is still of interest, in 
an asymptotic sense: Theorem 3.6.13 in [31] uses the limit of By/ when n 
goes to infinity as a normalizing constant. 

When the sample is only assumed to have a symmetric distribution, we 
obtain the following inequalities. 

Proposition 2.4. Let Y be a sample satisfying (SA), W an exchange- 
able resampling weight vector and </> : W K — > 1R any subadditive, positive ho- 
mogeneous function. 

(i) We have the following general lower bound: 



(13) 



^E[«KY-/,)]<E[<MY<">)]. 
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(ii) If the weight vector satisfies the assumption of (6), then we have the 
following upper bound: 

(14) D W ^{Y - n)] > E[(/»(Y< H/ -^ )] . 

The bounds (13) and (14) are tight (i.e., A\y/D]y — > 1 as n — > oo) for some 
classical weights; see Table 1. When Y is not assumed to have a symmetric 
distribution and W = 1 a.s., Proposition 2 of [13] shows that (13) holds with 
Aw replaced by E(Wi — W)+. Therefore, assumption (SA) allows us to get a 
tighter result (e.g., twice as sharp with Efron or Rademacher weights). It can 
be shown (see [1], Chapter 9) that this factor of 2 is unavoidable in general 
for a fixed n when (SA) is not satisfied, although it is unnecessary when n 
goes to infinity. We conjecture that an inequality close to (13) holds under 
an assumption less restrictive than (SA) (e.g., concerning an appropriate 
measure of skewness of the distribution of Y 1 ). 

2.3. Concentration around the expectation. In this section, we present 
concentration results for the two processes (p(Y — y) and Em/[0(Y^ m/_w/ ^)]. 

Proposition 2.5. Let p G [l,oo], Y be a sample satisfying (GA) and 
4>:M. K — > R be any subadditive function, bounded by the £ p -norm. 

(i) For all a G (0, 1), with probability at least 1 — a, we have 

(15) (K T- / z)<E[0(Y-^)]+ W 



n 



and the same bound holds for the corresponding lower deviations. 

(ii) Let W be an exchangeable resampling weight vector. Then, for all 
a G (0, 1), with probability at least 1 — a, we have 



(16) 



E w [<j><?W-w))] < E^Y^"^)] + WVj 

n 



and the same bound holds for the corresponding lower deviations. 

The bound (15) with a remainder in re" 1 / 2 is classical; this order in n 
cannot be improved, as seen, for example, by taking K = 1 and (j) to be the 
identity function. The bound (16) is more interesting because it illustrates 
one of the key properties of resampling, the "stabilization effect": the re- 
sampled expectation concentrates much faster to its expectation than the 
original quantity. This effect is known and has been studied asymptotically 
(in fixed dimension) using Edgeworth expansions (see [15]); here, we demon- 
strate its validity nonasympotically in a specific case (see also Section 4.2 
below for additional discussion). 
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In the bounded case, the next proposition is a minor variation of a result 
by Fromont. It is a consequence of McDiarmid's inequality [25]; we refer the 
reader to [13] (Proposition 1) for a proof. 

Proposition 2.6. Letpe [l,oo], M > 0, Y be a sample satisfying (BA) 
(p, M ) and <ft : M. K — > R be any subadditive function bounded by the £ p -norm. 

(i) For all a £ (0, 1), with probability at least 1 — a, we have 

(17) 4>(Y - M ) < E[0(Y - M )] + -^VlogU/a) 

and i/te same bound holds for the corresponding lower deviations. 

(ii) Lei 6e an exchangeable resampling weight vector. Then, for all 
a £ (0, 1), with probability at least 1 — a, we have 

(18) E^(Y<^>)] < EfeCV^-*))] + ^ybiaM 

v n 

and the same bound holds for the corresponding lower deviations. 

2.4. Resampling weight vectors. In this section, we consider the question 
of choosing an appropriate exchangeable resampling weight vector W when 
using Theorem 2.1 or Corollary 2.2. We define the following resampling 
weight vectors: 

1. Rademacher. Wi i.i.d. Rademacher variables, that is, Wi £ {—1,1} with 
equal probabilities. 

2. Efron (Efron's bootstrap weights). W has a multinomial distribution with 
parameters (n; n~ l , . . . , n^ 1 ). 

3. Random hold-out(q) [rho(g) for short], q £ {1, . . . ,n}. Wi = -lie/, where 
/ is uniformly distributed on subsets of {1, . . . ,n} of cardinality q. These 
weights may also be called cross-validation weights or leave-{n — q)-out 
weights. A classical choice is q = n/2 (assuming n is even) . When q = 
n — 1, these weights are called leave- one- out weights. Note that this re- 
sampling scheme is a particular case of subsampling. 

As noted in the Introduction, the first example is common in learning 
theory, while the second is classical in the framework of the resampling lit- 
erature [23, 28]. Random hold-out weights have the particular quality of 
being related to both: they are nonnegative, satisfy Y2i Wi = n a.s. and orig- 
inate with a data-splitting idea (choosing / amounts to choose a subsample) 
upon which the cross-validation idea has been built. This analogy motivates 
the "V-fold cross-validation weights" (defined in Section 2.5), in order to 
reduce the computational complexity of the procedures proposed here. 

For these classical weights, exact or approximate values for the quantities 
Ay/, By/, Cy/ and Dy/ [defined by (3) to (6)] can easily be derived (see 
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Table 1). Proofs are given in Section 5.3, where several other weights are 
considered. Now, to use Theorem 2.1 or Corollary 2.2, we have to choose a 
particular resampling weight vector. In the Gaussian case, we propose the 
following accuracy and complexity criteria: 

• First, relation (7) suggests that the quantity CyjB~^ can be proposed 
as an accuracy index for W. Namely, this index enters directly into the 
deviation term of the upper bound (while we know from Proposition 2.3 
that the expectation term is exact) so that the smaller this index is, the 
sharper the bound. 

• Second, an upper bound on the computational burden of exactly comput- 
ing the resampling quantity is given by the cardinality of the support of 
T>(W), thus providing a complexity index. 

These two criteria are estimated in Table 2 for classical weights. For any 
exchangeable weight vector W, we have CwB^ > [n/(n — l)] 1 ^ 2 and the 
cardinality of the support of T>(W) is larger than n. Therefore, the leave- 
one-out weights satisfy the best accuracy/complexity trade-off among ex- 
changeable weights. 

2.5. Practical computation of the thresholds. In practice, the exact com- 
putation of the resampling quantity E^i/[0(Y^ M/_M/ ^)] can still be too com- 
plex for the weights defined above. In this section, we consider two possi- 
ble ways to address this issue. First, it is possible to use nonexchangeable 
weights with a lower complexity index and for which the exact computation 
is tractable. Alternatively, we propose to use a Monte Carlo approximation, 
as is often done in practice to compute resampled quantities. In both cases, 
the thresholds have to be made slightly larger in order to keep a rigourous 
nonasymptotic control on the level. This is detailed in the two paragraphs 
below. 



Table 2 

Choice of the resampling weight vectors: accuracy /complexity trade-off 



Resampling 


CwB^ (accuracy) 


\swpp'D{W)\ (complexity) 


Efron 


< 1 Cl 1 )~ n ► e 


( 2 ;_- 1 ) = ©(n- 1 / 2 4") 


Rademacher 


<n/(n- 1) >1 

n — >oo 


2" 


rho(n/2) 


- \J n-l ?woo * 1 


(;; 2 )=e(n- i/2 2 n ) 


Leave-one-out 


_ \J n-l n ^ oc > 1 


n 


Regular V-icv 




V 
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2.5.1. V -fold, cross-validation weights. In order to reduce the computa- 
tion complexity, we can use "piecewise exchangeable" weights: consider a 
regular partition (Bj)±<j<y of {l,...,n} (where V G {2, ...,n} and V di- 
vides re) and define the weights W% = yrryl^B/ with J uniformly distributed 
on {1, ... , V}. These weights are called the (regular) V -fold cross-validation 
weights (V-fcv for short). 

By applying our previous results to the process (Y J )i<j<y, where Y J := 
^■^ igB .Y* is the empirical mean of Y on block Bj, we can show that 
Theorem 2.1 can be extended to (regular) V-fold cross-validation weights 
with the following resampling constants: 



A w = —, B w = -j==, Cw = y _ p D w = l. 

Additionally, when V does not divide n and the blocks are no longer regular, 
Theorem 2.1 can also be generalized, but the constants have more complex 
expressions (see Section 10.7.5 in [1] for details). With V-fcv weights, the 
complexity index is only V, but we lose a factor [(re — 1)/(V — l)] 1 ^ 2 in 
the accuracy index. With regard to the accuracy/complexity trade-off, the 
most accurate cross-validation weights are leave-one-out (V = n), whereas 
the 2-fcv weights are the best from the computational viewpoint (but also 
the least accurate). The choice of V is thus a trade-off between these two 
terms and depends on the particular constraints of each problem. 

However, it is worth noting that as far as the bound of inequality (7) is 
concerned, it is not necessarily indispensable to aim for an accuracy index 
close to 1. Namely, this will result in a corresponding deviation term of order 
re" 1 , while there is, additionally, another unavoidable deviation term or order 
n -1 / 2 in the bound. This suggests that an accuracy index of order o(n 1//2 ) 
would actually be sufficient (as re grows large). In other words, using V-fcv 
with V "large" [e.g., V = 0(log(re))] would result in only a negligible loss 
of overall accuracy as compared to leave-one-out. Of course, this discussion 
is specific to the form of the bound (7). We cannot formally exclude the 
possibility that a different approach could lead to a different conclusion, 
unless it can be proven that the deviation terms in (7) cannot be significantly 
improved, an issue we do not address here. 

2.5.2. Monte Carlo approximation. When using a Monte Carlo approx- 
imation to evaluate Ejy [(/'(Y^ -1- ^)], we randomly draw a number B of 

i.i.d. weight vectors IV 1 ,..., W B and compute J2f=i </>(Y^ J_M ^). This 
method is quite standard in the bootstrap literature and can be improved 
in several ways (see, e.g., [15], Appendix II). 

On the one hand, the number B of draws of W should be taken small 
enough so that B times the computational cost of evaluating cj)(Y^ w ' '- WJ )) 
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is still tractable. On the other hand, the number B should be taken large 
enough to make the Monte Carlo approximation accurate. In our framework, 
this is quantified more precisely by the following proposition (for bounded 
weights) . 

Proposition 2.7. Let B>1 and W 1 ,...,W B be i.i.d. exchangeable 
resampling weight vectors such that W\ — W 1 £ [cijCfc] a.s. Let p£ [l,oo] 
and 4>:R K be any subadditive function bounded by the £ p -norm. IfY is 
a fixed sample, then, for every f3 E (0, 1), 



B 



(19) %[0^)] < I][>(Y^-^>) + (<* -^f^^WaWp 

holds with probability at least 1 — f3, where a denotes the vector of aver- 
age absolute deviations to the median, a := ((- X^ILil^fc ~~ ^k\))i<k<K [M^ 
denoting a median of (yi)i<i<n]- 

As a consequence, Proposition 2.7 suggests an explicit correction of the 
concentration thresholds taking into account B bounded weight vectors. For 
instance, with Rademacher weights, we can use (19) with c% — c\ = 2 and (5 = 
7a [7 G (0, 1)]. Then, in the thresholds built from Theorem 2.1 and Proposi- 
tion 2.2, one can replace Eyi/fc^Y^ -1- ^)] by its Monte Carlo approximation 

at the cost of changing a into (1 —7)0 and adding B^yJ^- ^^ ^ \\a\\ p 
to the threshold. 

As n grows large, this remainder term is negligible in comparison to the 
main one when B is (for instance) of order n 2 . In practical applications, 
B can be chosen as a function of Y because (19) holds conditionally on 
the observed sample. Therefore, we can use the following strategy: first, 
compute a rough estimate t cs t.a of the final threshold [e.g., if (f> = \\ ■ ||oo 
and Y is Gaussian, take the Bonferroni threshold (10)]. Then choose B ^> 

4t,aPllplog((7Q)- 1 ). 

3. Confidence region using resampled quantiles. 

3.1. Main result. In this section, we consider a different approach to con- 
structing confidence regions, directly based on the estimation of the quan- 
tile via resampling. Once again, since we aim for a nonasymptotic result for 
K S> n, the standard asymptotic approaches cannot be applied here. For this 
reason, we base the proposed results on ideas coming from exact randomized 
tests and consider here the case where Y 1 has a symmetric distribution and 
where W is an i.i.d. Rademacher weight vector, that is, weights are i.i.d. 
with F(Wi = 1) = F(Wi = -1) = 1/2. 
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The resampling idea applied here is to approximate the quantiles of the 
distribution T>(cp(Y — /i)) by the quantiles of the corresponding resampling- 
based distribution: 

P(<^(Y^- W >)|Y)=I?(^((Y-Y)W)|Y). 

For this, we take advantage of the symmetry of each Y* around its mean. 
For a function cj>, let us define the resampled empirical quantile by 

(20) Y) :=m£{xeR\F w {<f>(y< w >)>x) < a}. 

The following lemma, close in spirit to exact test results, is easily derived 
from the "symmetrization trick," that is, from taking advantage of the dis- 
tribution invariance of the data via sign reversal. 

Lemma 3.1. Let Y be a data sample satisfying assumption (SA) and 
<j> : M K — > R be a measurable function. The following then holds: 

(21) F(4>(Y-fj,)>q a ((l>,Y-fi))<a. 

Of course, since (/ Q ((j6, Y — /i) still depends on the unknown fi, we cannot 
use this threshold to get a confidence region of the form (1). It is, in principle, 
possible to build a confidence region directly from Lemma 3.1 by using the 
duality between tests and confidence regions, but this would be difficult to 
compute and not of the desired form (1). Therefore, following the general 
philosophy of resampling, we propose replacing the true mean \x by the 
empirical mean Y in the quantile q a ((f>, Y — //). The following main technical 
result of this section gives a nonasymptotic bound on the cost of performing 
this operation. 

Theorem 3.2. Fix 5,ao £ (0,1). Let Y be a data sample satisfying 
assumption (SA). Let f:(^ K ) n — > [0, oo) be a nonnegative function. Let 
cf) : W K — > R be a nonnegative, subadditive and positive homogeneous func- 
tion. Define <j)(x) := max((p(x),(j)(—x)). The following holds: 

¥(4>(Y -fi)> q ao{1 _ s) (4>, Y - Y) + 71 M/OO) 

(22) 

<a o +P(0(Y-//)>/(Y)) ) 

where 71(77) := 2g ("'^/ 2 ) n and B{n, rj) := maxjfc E {0, . . . , n}\2~ n Y2?=k Cl) — 
rf\ is the upper quantile function of a Binomial(n, ^) variable. 

In this result, the resampled quantile term q ao (\s){<t>i Y — Y) should be 
interpreted as the main term of the threshold and the rest, involving the 
function /, as a remainder term. In the usual resampling philosophy, one 
would only consider the main term at the target level, that is, ao = a and 5 = 
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0. Here, the additional remainder terms are introduced to account rigorously 
for the validity of the result in a nonasymptotic setting. These remainder 
terms have two effects: first, the resampled quantile in the main term is 
computed at a "shrunk" error level «o(l — S) <a and, secondly, there is an 
additional additive term in the threshold itself. 

The role of the parameters 5, ao and / is to strike a balance between 
these effects. Generally speaking, / should be an available upper bound on 
a quantile of c/>(Y — /x) at a level a\ <C «o- On the left-hand side, / appears 
in the threshold with the factor 71, which can be more explicitly bounded 



using Hoeffding's inequality. The above result therefore transforms a possibly 
coarse "a priori" bound / on quantiles into a more accurate quantile bound 
based on a main term estimated by resampling and a remainder term based 
on / multiplied by a small factor. 

In order to get a clearer insight, let us consider an example of specific 
choices for the parameters 5, ao and / in the Gaussian case. First, choose 
8 = 0(n~ 7 ) and — = 1 — 0(?x -7 ) for some 7 > 0, say 7 = 1. This way, the 
main term is the resampled quantile at level ao(l — 8) = a (l ~~ 0(n~ 7 )). 
For the choice of /, let us choose Bonferroni's threshold (10) at level ot\ = 
(q — ao) = 0(ra~ 7 ) so that the overall probability control in (22) is really 
at the target level a. Then / Bon f(Y) < e((log(ETn 7 )/n) 1 / 2 ) and, using (23), 
we conclude that the remainder term is bounded by 0(log(A'n 7 )/n). This 
is indeed a remainder term with respect to the main term which is of order 
at least 0(n -1//2 ) as n grows [assuming that the dimension K{n) grows 
subexponentially with n]. 

There are other possibilities for choosing /, depending on the context: the 
Bonferroni threshold can be correspondingly adapted to the non-Gaussian 
case when an upper bound on the tail of each coordinate is available. This 
still makes the remainder term directly dependent on K and a possibly more 
interesting idea is to recycle the results of Section 2 (when the data is either 
Gaussian or bounded and symmetric) and plug in the thresholds derived 
there for the function /. 

Finally, if the a priori bound on the quantiles is too coarse, it is possible 
to iterate the process and estimate smaller quantiles more accurately by 
again using resampling. Namely, by iteration of Theorem 3.2, we obtain the 
following corollary. 

Corollary 3.3. Fix J a positive integer, (aj)i=o ... j-i a finite sequence 
in (0,1) and 5 E (0,1). Consider Y, /, <p and <j> as in Theorem 3.2. The 



by 



(23) 
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following then holds: 

P^(Y-/x)> fco(1 _ 5) (^Y-Y) 

(24) + ]T 7^(1-*) & Y - Y) + 7J/CY)") 

t=i / 
j-l 

<^Q i + P(0(Y- / u)>/(Y)), 

i=0 

w/iere, /or fc > 1, 7fc := n~ k Y\i=o( 2 ~B( n , ^f) ~ n )- 

The rationale behind this result is that the sum appearing inside the 
probability in (24) should be interpreted as a series of corrective terms of 
decreasing order of magnitude because we expect the sequence 7^ to be 
sharply decreasing. From (23), this will be the case if the levels are such 
that aj 3> exp(— n). 

The conclusion is that even if the a priori available bound / on small 
quantiles is not sharp, its contribution to the threshold can be made small in 
comparison to the (more accurate) resampling terms. The counterpart to be 
paid is the loss in the level and the additional terms in the threshold; for large 
n, these terms decay very rapidly, but for small n, they may still result in a 
nonnegligible contribution; in this case, a precise tuning of the parameters 
J, (aj), 5 and / is of much more importance and also more delicate. 

At this point, we should also mention that the remainder terms given by 
Theorem 3.2 and Corollary 3.3 are certainly overestimated, even if / is very 
well chosen. This makes the theoretical thresholds slightly too conservative 
in general (particularly for small values of n). From simulations not reported 
here (see [2] and Section 4.3 below), it even appears that the remainder terms 
could be (almost) unnecessary in standard situations, even for n relatively 
small. Proving this fact rigorously in a nonasymptotic setting, possibly with 
some additional assumption on the distribution of Y, remains an open issue. 
Another interesting open problem would be to obtain a self-contained result 
based on the symmetry assumption (SA) alone [or a negative result proving 
that (SA) is not sufficient for a distribution-free result of this form]. 

3.2. Practical computation of the resampled quantile. Since the above re- 
sults use Rademacher weight vectors, the exact computation of the quantile 
q a requires, in principle, 2 n iterations and is thus too complex as n becomes 
large. Parallel to what was proposed for the concentration-based thresholds 
in Section 2.5, one can, as a first solution, consider a blockwise Rademacher 
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resampling scheme or, equivalently, applying the previous method to a block- 
averaged sample, at the cost of a (possibly substantial) loss in accuracy. 

A possibly better way to address this issue is by means of Monte Carlo 
quantile approximation, on which we now focus. Let W denote annxB 
matrix of i.i.d. Rademacher weights (independent of all other variables) and 
define 



g a (^,Y,W):=inf< 



{ j=l 



<a 



that is, q a is defined in the same way as q a j except that the true distribution 
Pw of the Rademacher weight vector is replaced by the empirical distribu- 
tion constructed from the columns of W, Pw = B~ l Ylf=i ^wj "i note that 
the strict inequality ^(Y^ w ^) > x in (20) was replaced by ^(Y^ WJ ^) > x for 
technical reasons. The following result then holds. 



Proposition 3.4. Consider the same conditions as in Theorem 3.2, 
except that the function f can now be a function of both Y and W. We 
have 

Py,w(<A(Y -fj)> q ao{1 . s) Y - Y, W) + 7(W, a 5)f(Y, W)) 
< 5 + Py,w(0(Y - fi) > f(Y, W)), 

where 5o := g+i < ao + sqrr and~/(W,r]) := max{y > 0\^^2f =1 1{|W : '| > 
y} > v} i s th e (1 — rj)-quantile of \W\ under the empirical distribution Pw- 

Note that, for practical purposes, we can choose /(W,Y) to depend on 
Y only and use another type of bound to control the last term on the right- 
hand side, as in the earlier discussion. The above result tells us that if, in 
Theorem 3.2, we replace the true quantile by an empirical quantile based on 
B i.i.d. weight vectors and the factor 71 is similarly replaced by an empirical 
quantile of \ W\ , then we lose at most (B + 1)" 1 in the corresponding covering 
probability. Furthermore, it can easily be seen that if «o is taken to be a 
positive multiple of (B + then there is no loss in the final covering 

probability (i.e., 5q = ao). 



4. Discussion and concluding remarks. 



4.1. Estimating \\cr\\ p . In the concentration approach and in the Gaus- 
sian case, the derived thresholds depend explicitly on the £ p -norm of the 
vector of standard deviations a = {ok)k (an upper bound on this quantity 
can also be used). While we have left aside the problem of determining this 
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parameter if no prior information is available, it is possible to estimate a by 
its empirical counterpart 



Interestingly, the quantity ||(x||p enjoys the same type of concentration prop- 

erty as the resampled expectations considered in Section 2.3 so that we can 
derive, by a similar argument, a dimension-free confidence bound for 
as follows. 

Proposition 4.1. Assume thatY satisfies (GA). Then, with probability 
at least 1 — 5, 



It can easily be checked via Stirling's formula that C n = 1 — 0{n~ l ), so 
replacing ||o"|| p by the above upper bound does not make the corresponding 
thresholds significantly more conservative. 

A similar question holds for the parameter M in the bounded case. In 
practical applications, an absolute bound on the possible data values is often 
known (e.g., from physical or biological constraints). It can also be estimated, 
but it seems harder to obtain a rigorous nonasymptotic control on the level 
of the resulting threshold in the general bounded case. 

A different, and potentially more important, problem arises if the vector 
of variances a is not constant. Since the confidence regions proposed in this 
paper are isotropic, they will — inevitably — tend to be conservative when 
the variances of the coordinates are very different. The standard way to ad- 
dress this issue is to consider studentized data. While this would solve this 
heteroscedasticity issue, it also renders void the assumption of independent 
data points — a crucial assumption in all of our proofs. Therefore, general- 
izing our results to studentized observations is an important, but probably 
challenging, direction for future work. 

4.2. Interpretation and use of <f>- confidence regions. We have built high- 
dimensional confidence regions taking the form of "c/>-balls" [where (j) can 
be any £ p -norm. with p>l, but more general choices are possible, such 
as 4>(x) = sup fc (:cfc)_|_]. Such confidence regions in very high dimension are 
certainly quite difficult to visualize and one can ask how they are to be 
interpreted. In our opinion, the most intuitive and interesting interpretation 





(25) 
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again comes from learning theory, by regarding as a type of loss function. 
In this sense, a (^-confidence region is an upper confidence bound on some 
relevant loss measure of the estimator Y to the target [i. Additionally, in the 
particular case when (j) = sup k (xk)+ or (f) — || • 1 1 005 the corresponding regions 
can be interpreted as simultaneous confidence intervals over all coordinate 
means. 

The results presented here can also provide confidence intervals for the 
(P-risk (i.e., the averaged </>-loss) for the estimator Y of the mean vector 11. 
Indeed, combining (12) and Proposition 2.5(h), we derive that for a Gaussian 
sample Y and any p £ [1, 00], the upper bound 

(26) E |,Y- A <M»fcl + ^ T - 1(a/2) , 

r>w tidy/ 

holds with probability at least 1 — a and a similar lower bound holds. It is 
worth noting that the rate Cw / ( n Bw) is close to n _1 for most of the weights, 
meaning that resampling provides a much better estimate of E||Y — fi\\ p 
than ||Y — ri\\ p itself. This stabilization effect of resampling is well known in 
standard asymptotic settings (see, e.g., [15]). 

The ^ p -risk is also related to the leave-one-out estimation of the prediction 
risk. Indeed, consider using Y for predicting a new data point Y n+1 ~ Y 1 
[independent of Y = (Y 1 , . . . , Y n )]. The corresponding ^-prediction risk is 
given by E||Y — Y n+1 || p . In the Gaussian setting, this prediction risk is 
proportional to the F-risk: E||Y - fi\\ p = (n + l^EHY - Y n+1 || p , so the 
previous resampling estimator of the ^ p -risk also leads to an estimator of 
the prediction risk. In particular, using leave-one-out weights and denoting 
by Y( -4 ), the mean of the (Y J , j 7^ i, 1 <j < n), our results prove that the 
leave-one-out estimator 

i=i 

correctly estimates the prediction risk [up to the factor (1 — 1/n 2 ) 1 / 2 ~ 1]. 

Finally, another important field of application is hypothesis testing. When 
4> = sup fc (xfc) + or (j) = || ■ 1 1^, the thresholds derived here can be used to de- 
rive multiple testing procedures for the value of the mean of each coordinate. 
This question is extensively developed in the companion paper [2]. It is also 
possible to take advantage of the generality of our results, where <j) is allowed 
to be any ^ p -norm with p > 1 , for single global hypothesis testing. The con- 
fidence regions can be used straightforwardly to test several single global 
hypotheses, such as /j, = n* against — n*\\ p > R > 0. Depending on partic- 
ular features of the problem, having the choice between different functions 
(j) allows us to take into account specific forms of alternative hypotheses in 
the construction of the threshold. 
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4.3. Simulation study. In the companion paper [2] (Section 4), a simu- 
lation study compares the thresholds built in this paper and Bonferroni's 
threshold, using (p — || • ||oo> considering Gaussian data with different lev- 
els of correlations and assuming the coordinate variance a to be constant 
and known. Without entering into details, its general conclusions fol- 
lows. First, all of the thresholds proposed in the present paper can improve 
on Bonferroni's when the correlations are strong enough. Even though our 
thresholds are seen to be more conservative than the "ideal" one (i.e., the 
true quantile), they all exhibit adaptivity to the correlations, as expected 
from their construction. However, when the vector coordinates are close to 
being independent, the proposed thresholds are somewhat more conservative 
than Bonferroni's (the latter being essentially optimal in this case). 

The second observation made on the simulations is that the quantile ap- 
proach generally appears to be less conservative than the concentration ap- 
proach. However, the remaining advantage of the concentration approach 
is that it can be combined with Bonferroni's threshold (using Proposition 
2.2) so that one can almost take "the less conservative of the two" and only 
suffer a negligible loss if the Bonferroni threshold turns out to be better. 
Also, recall that the concentration threshold can be of use for the remainder 
terms of the quantile threshold. 

Finally, we also tested the resampled quantile without remainder term 
(i.e., taking the raw resampled quantile of the empirically centered data 
at the desired level, without modification). Although this threshold is not 
theoretically justified in the present work, it appeared to be very close to the 
ideal threshold in the performed simulations. This supports the conjecture 
that the remainder terms in the theoretical threshold could either be made 
significantly smaller or, possibly, even completely dropped in some cases. 

4.4. Comparing nonasymptotic and asymptotic approaches. Although sim- 
ulations have shown that the various thresholds proposed here can outper- 
form Bonferroni's when significant correlations are present, we have also no- 
ticed that these thresholds are generally noticeably more conservative than 
the ideal ones (the true quantiles), especially for small values of n. Moreover, 
taking into account other sources of error such as the estimation of \\cr\\p as 
above, or Monte Carlo approximations, will result in even more conservative 
thresholds. The main reason for this additional conservativeness is that our 
control on the level is nonasymptotic, that is, valid for every fixed K and n. In 
this sense, it would be somewhat unfair to compare the thresholds proposed 
here to those of "traditional" resampling theory that are only proved to be 
valid asymptotically in n and for fixed K. The nonasymptotic results derived 
here can nevertheless also be used for an asymptotic analysis, in a setting 
where K (n) is a function of n, and possibly rapidly (say, exponentially) 
growing. This type of situation seems to have been only scarcely touched 
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by existing asymptotic approaches. In this sense, in practical situations, we 
can envision "cheating" somewhat and replacing the theoretical thresholds 
by their leading component [under some mild assumptions on the growth of 
K{n)\ as n tends to infinity. Prom a theoretical point of view, an interesting 
avenue for future endeavors is to prove that the thresholds considered here, 
while certainly not second order correct, are at least asymptotically optimal 
under various dependency conditions. 



5. Proofs. 



5.1. Confidence regions using concentration. In this section, we prove 
all of the statements of Section 2 except computations of resampling weight 
constants (made in Section 5.3). 



5.1.1. Comparison in expectation. 



Proof of Proposition 2.3. Denoting by I] the common covariance 
matrix of the Y\ we have V(Y( W ~W)\W) = Af(0, (n" 1 £" =1 (Wj -W) 2 ^' 1 ^) 
and the result follows because T>(Y — fi) = AA(0, and cj) is positive ho- 

mogeneous. This last assumption is, of course, unnecessary if it holds that 



Proof of Proposition 2.4. By independence between W and Y, 
exchangeability of W and the positive homogeneity of <fi, for every realization 
of Y, we have 



A w <f)(Y -fi) = <f> 
Then, by convexity of </>, 

A W (f>(Y-fi) <E 



(e I^|Wi-W|(Y*-/x)|Y ^. 



i=l 



We integrate with respect to Y and use the symmetry of the Y* with 
respect to [i and, again, the independence between W and Y to show, finally, 
that 



A w E[(j){Y - n)\ < E 



E 



' n N 

I£|Wi-W|CY*- M ) 



i=l 



-£(Wi-W)(Y*-M) 



i=l 



:E[0(Y< W - W >)]. 
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The point (ii) is proved via the following chain of inequalities: 

E[</>(Y< W - W >)] <E 



+ E 



E 



+ E 



n 



' i=l 

i=i ) 

i n 

-Y,\x -W\(Y*-v) 

n ^ — ^ 



i=l 



< (a + E\W - x \)E[cP(Y - fi)\. 

In the second line, we used, as before, the symmetry of the Y* with respect 
to /i, together with the independence of W and Y. In the last inequality, 
we used the assumption \W{ — xq\ = a a.s. and the positive homogeneity of 
0. □ 

5.1.2. Concentration inequalities. 

Proof of Proposition 2.5. Here, we use concentration principles fol- 
lowing closely the approach in [24], Section 3.2.4. The essential ingredient 
is the Gaussian concentration theorem of Cirel'son, Ibragimov and Sudakov 
([7] and recalled in [24], Theorem 3.8), stating that if F is a Lipschitz func- 
tion on M N with constant L, then, for the standard Gaussian measure on 
R N , we have P(F > E[F] + t) < 2$(t/L). 

Let us denote by A a square root of the common covariance matrix of 
the Y\ If £j is a i'f-dimensional, standard normal vector, then A(j has the 
same distribution as Y*-/i. For all ( G (R K ) n , we let 7i(C) := <f>(^ E"=l Mi) 
and T 2 (C) := E[</>(± X*=iW ~ W) A Ci)]- If we endow (R K ) n with the stan- 
dard Gaussian measure, then T\ (resp., T%) has the same distribution as 
0(Y-/x) [resp., 0(Y< W - W >)]. 

From the Gaussian concentration theorem recalled above, in order to reach 
the conclusion, we therefore only need to establish that T\ (resp., T^) is a 
Lipschitz function with constant ||c|Ip/v^ (resp., ||cr|| p Cw/n) with respect 



to the Euclidean norm || • \\2,Kn on (R ) n . Let G (K ) n and denote by 
( a k)i<k<K the rows of A. Using the fact that <f> is 1-Lipschitz with respect 
to the ^p-norm (because it is subadditive and bounded by the £ p -norm), we 
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get 



i=l v V \ »=1 



~E(6-£) 



For each coordinate A;, by the Cauchy-Schwarz inequality and since ||oft||2 
we deduce that 

/ -. n \ -. n 

n / n 



i=i 



i=l 



Therefore, we get 

|ri(C)-ri(OI<IH| P 



i n 

-EK.-o 



1=1 



^^IK-C'lk 



Kni 



using the convexity of x £ M. K i— >■ ||x||2, and we obtain (i). For T2, we use the 
same method as for Ti to obtain 



|r 2 (c)-r 2 (OI<lkll P E 



(27) 



1 n 



i=l 



< 



cr „ 



n 



E 



i=i 



Note that since (£™ =1 (W; ~ W )? = °_i_ we have E ( W i ~ ^0(^2 - W) = 
-Cyy/n. We now develop lEILiC^* — W)(Ci — CD Hi m tne Euclidean space 



E 



i=i 



r 2 



c^a - o E lie* - m - - & - Cj 



1=1 



r 2 

n 



i=l 



Consequently, 



(28) E 



E(^-^)(6-CD 



i=l 



<c^J2\\Ci-cl\\t=cUc-c'\\i K n- 



i=l 



Combining expression (27) and (28), we find that T 2 is 1 1 cr 1 1 p Cvi/ / n-Lipschitz . 
□ 
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Remark 5.1. The proof of Proposition 2.5 is still valid under the weaker 
assumption (instead of exchangeability of W) that E[(Wi — T4 / )(W r j — W)\ 
can only take two possible values, depending on whether or not i = j. 



5.1.3. Main results. 



Proof of Theorem 2.1. The case (BA) (p, M) and (SA) is obtained 
by combining Propositions 2.4 and 2.6. The (GA) case is a straightforward 
consequence of Proposition 2.3 and the proof of Proposition 2.5 (considering 
the Lipschitz function T\ — T2). □ 

Proof of Proposition 2.2. From Proposition 2.5(i), with probability 
at least 1 — a(l — 5), <p(Y — /i) is less than or equal to the minimum of i Q (i_5) 

and E[<^>(Y — fj,)] + — ^ 1 *^/ 2 ) (since both of these thresholds are deter- 
ministic). In addition, Propositions 2.3 and 2.5(h) give that with probability 
at least 1-aS, E[</>(Y-/j)] < E w[H^- w) )\ + hhC^ §- x ( a5 / 2 ). The result 
follows by combining the last two expressions. □ 



5.1.4. Monte Carlo approximation. 



Proof of Proposition 2.7. The idea of the proof is to apply McDi- 
armid's inequality (see [25]) conditionally on Y. For any realizations W and 
W' of the resampling weight vector and any v £ M fc , we have 



< 



C2 - Cl 



?? 



since (j) is subadditive, bounded by the £p-norm and Wi — W£ [01,02] a.s. 

The sample Y being deterministic, we can take equal to a median 
of (Y|)i<j< n . Since W 1 , . . . ,W B are independent, McDiarmid's inequality 
gives (19). □ 



5.1.5. Estimation of the variance. 



Proof of Proposition 4.1. We use the same notation and approach 
based on Gaussian concentration as in the proof of Proposition 2.5. Writing 
Y* — fj, = AQ, we upper bound the Lipschitz constant of \\cr\\p as a function 
of C = (Cl, • • • , Cn): given C, C G (^) n , we have 

P(OII P -ROII P <P(C)-^OII P 
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< 



< 



(£X>.(&-a-«-?)> 2> ) 

V t=i / k 

^fEii(a-c)-(cr 




We then additionally have 

n n 

Eii^-c)-(£-c')iil=E^-£i 



HIC-CllBllC-C'll 2 . 



allowing us to conclude that ||5(C)||p has Lipschitz constant Concerning 
the expectation, observe that for each coordinate k, the variable ^fnSkl^k 
has the same distribution as the square root of a x 2 ( n — 1) variable. Elemen- 
tary calculations for the expectation of such a variable lead to E[<7fc] = C n <7fc. 
We finally conclude that with probability at least 1 — 5, the following inequal- 
ity holds: 



Km 



CUM 



||E[S]|| p <E[||Sy <\\a\\ p + tk^f 6 



Solving this inequality in ||<t|| p yields the result. □ 

5.2. Quantiles. Recall the following inequality coming from the defini- 
tion of the quantile q a : for any fixed Y, 

(29) Pff^fYW) > ?a (0, Y)) < a < Fwi^Y^) > q a (<p,Y)). 

Proof of Lemma 3.1. We introduce the notation Y»W = Y-diag(VF) 
for the matrix obtained by multiplying the ith column of Y by Wj, i = 
1, . . . , n. We then have 



(30) 



V(^(Y- / u)>g a (0,Y-/x)) 

= E w [F Y (<j ) ((Y^JI)W) > q a (<f>, (Y - M ) . VF))] 
= E Y [P W (<K(Y^7I) W ) > q a {4>,Y - /i))] < a. 



The first equality is due to the fact that the distribution of Y satisfies 
assumption (SA), hence the distribution of (Y — y) is invariant under mul- 
tiplying by (arbitrary) signs W S { — l,l} n . In the second equality, we used 
Fubini's theorem and the fact that for any arbitrary signs W, as above, 
q a ((f>, (Y — (J,) • W) = q a ((j),Y — y). Finally, the last inequality follows from 
(29). □ 
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Proof of Theorem 3.2. Write 71 = 7i(ao<$) for short and define the 
event 

£ := {Y\q ao ((f),Y — fi) < g ao(1 _ 5) (<£, Y - Y) + 71/OO}. 
We then have, using (30), 

P(0(Y - fi) > q ao(1 - 5) (<t>, Y - Y) + 7 i/(Y)) 
(31) < n<t>(Y -y)> g ao (4>, Y — fi)) + P(Y G £ c ) 

<a + P(YG^). 

We now concentrate on the event £ c . Using the subadditivity of (f> and 
the fact that (Y~^7J)W = (Y- Y)< w > + W(Y-/z), we have, for any fixed 

qo < P W (0((Y^)W) > 9ao (0, Y - //)) 

< P W (^((Y^)W) > feod-i)^, Y - Y) + 7l /(Y)) 



< P W (0((Y- Y)W) > gao( i_i)(^ Y- Y)) 
+ P W (0(F(Y- M ))> 71 /(Y)) 

< oo(l - 5) + P^G^Y - //)) > 7i/(Y)). 

For the first and last inequalities, we have used (29) and for the second 
inequality, the definition of £ c . From this, we deduce that 

£ c C {Y\F W (^(W(Y - fi)) > 71/OO) > a 5}. 

Now, using the positive homogeneity of (j> and the fact that both <fi and / 
are nonnegative, we have 

F w (<f>(W(Y-fi))> 7l f(Y)) 

7i/O0 



= F w i^\W\ > 
<¥ w ( \W\ > 



i>(sign(W)(Y - fi)) 
7i/(Y) 



(Y — fi) 

= 2F B J l - { 2B n -n)>J0n_ 
\n 4>{Y-ft) 

where B n denotes a Binomial (n, 5) variable (independent of Y). From the 

last two displays and the definition of 71, we conclude that £ c C {Y|</>(Y — 
fi) > /(Y)}, which, substituted back into (31), leads to the desired conclu- 
sion. □ 
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Proof of Corollary 3.3. Define the function 

g (Y) = q {1 _ s)ao (</>, Y-Y)+\J2 7*9(1-*)^ Y — Y) + 7J /(Y) J 
and, for A; = 1, . . . , J, 

5fc (Y) = 7i?(i-*)a« Y - Y) + 7 J/(Y) j 

with the convention that gj = /. For < k < J — 1, applying Theorem 3.2 
with the function gt+\ yields the relation 

F w (<X Y - m) > <?fc ( Y) ) < a k + P w (</>( Y - M ) > fljt+i ( Y) ) . 
Therefore, we get 

J-i 

P W (0(Y - //) > ff o(Y)) < £ oh + P(0(Y - A*) > /(Y)) 
as announced. □ 

Proof of Proposition 3.4. Let us first prove that an analog of Lemma 
3.1 holds with q ao replaced by q ao . First, we have 

E w Py(0(Y -h)> q ao (0, Y - (i, W)) 

= E^E w Py(0((Y~^)< m/ '>) > g Q0 (^ (Y - fi) . W)) 

= E Y Pw,iy (<X( Y - ^)^ ) > q ao (</>, Y - //, W • W)), 

where W denotes a Rademacher vector independent of all other random 
variables and W • W = diag(W') • W denotes the matrix obtained by mul- 
tiplying the ith row of W by W[ , i = 1, . . . , n. Note that (W , W • W) ~ 
(W',W). Therefore, by definition of the quantile q ao , the latter quantity is 
equal to 

EyPw.Hr, (!El{^((Y^)< WJ >) > 0((Y^)^'>)} < « ) < 

where the last step comes from Lemma 5.2 (see below). 

The rest of the proof is similar to the one of Theorem 3.2, where P^y 
is replaced by the empirical distribution based on W, P\y = 5 Ylf=i^wi • 
Thus, (29) becomes, for any fixed Y, W, 

P W [0(Y W ) > q ao (<p, Y, W)] <a < ¥ w [<f>(Y^) > q ao (0, Y, W)] . 
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The role of £ is then taken by 

£ := {Y, W\q ao (<f>, Y - H, W) < q ao{1 ^) (<P, Y - Y, W) + T /(Y, W)}, 
where we write 7 = 7(W,ao<5) for short. We then have, similarly to (31), 

Py,wMT-A*) > 9ao(i-i)(^Y-Y)+ 7 /(Y, W)) < L^oJ+l +p ^ w( g C) 

and follow the proof of Theorem 3.2 further, we obtain 

7 /(Y,W 



£ C C |Y,W 
which gives the result. □ 



Pw 



\W\> 



0(Y-/i) 



We have used the following lemma which essentially reproduces Lemma 
1 of [30] , with a minor strengthening. While the proof was left to the reader 
in [30], because it was considered either elementary or common knowledge, 
we include a succinct proof below for completeness. 

Lemma 5.2 (Minor variation of Lemma 1 of [30]). Let Zq, Zi,...,Zb 
be exchangeable real-valued random variables. Then, for all a E (0, 1), 



( 1 r , \ \Ba\ + 1 

^-yj 1{Zj >z o} < a j<L±_ 



1 

<a + 



B + l 



The first inequality becomes an equality if Zi 7^ Zj a.s. For example, it is the 
case if the Zi 's are i.i.d. variables from a distribution without atoms. 

Proof. Let U denote a random variable uniformly distributed in {0, . . . , B} 
and independent of the Zi's. We then have 

= P Hfy > z o} < Ba + 

= P C7 P (Zi) HZj > Zu] < Ba + 1^ 

= (Jt ^ Z "} * l*«J +lj< 

Note that the last inequality is an equality if the Z^s are a.s. distinct. □ 
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5.3. Exchangeable resampling computations. In this section, we compute 
constants Aw, Bw, Cw and Dw [defined by (3) to (6)] for some exchange- 
able resamplings. This implies all of the statements in Table 1. We first 
define several additional exchangeable resampling weights (normalized so 
that E[Wi] = 1): 

• Bernoulli{p), p€ (0, 1): pWi i.i.d. with a Bernoulli distribution of param- 
eter p. A classical choice is p = \ . 

• Efron(q), q E {1, . . . ,n}: qn~ l W has a multinomial distribution with pa- 
rameters (q; n _1 , . . . , n _1 ). A classical choice is q = n. 

• Poisson(n), fi E (0,+oo): [iWi i.i.d. with a Poisson distribution of param- 
eter fi. A classical choice is fj, = 1. 

Note that ~y{ w ~ w ) an d all of the resampling constants are invariant under 
translation of the weights so that Bernoulli(l/2) weights are completely 
equivalent to Rademacher weights in this paper. 

Lemma 5.3. 1. Let W be Bernoulli{p) weights with p E (0,1). We then 
have 2(1 - p)(l -±) = A W <B W < ^^l^Jl^, C w = S J±^\ and 

D w < ^ + 1^ - 1| + \f^-- 

2. Let W be Efron(q) weights with q E {1, . . . , n}. We then have 2(1 — = 
A w < B w < ^± and C w = • 

3. Let W be Poisson(n) weights with fi > 0. We then have Aw < Bw < 

- \ and C w = ^p- Moreover, if fi = l, we get \ - ^ < A w . 

4. Let W be Random hold-out(q) weights with q E {1, . . . ,n}. We then have 
A w = 2{l-l) ; Bw = JP^,C w = ^jpri andDw= n +]1 _ 

2q\- 

Proof. We consider the following cases: 

General case. First, we only assume that W is exchangeable. Then, from 
the concavity of \f- and the triangular inequality, we have 



ElWi - E[Wi]| - \ M(W - E[WM) 2 

(32) 



/n — 1 

< ElWi - E[Wi]| -E\W- E[Wi]\ <A W <B W < \ C w . 

V n 

Independent weights. When we suppose that the W{ are i.i.d., we get 



(33) ElWi - E\W X \\ - V VaJ W ) < Aw and Cw = v / V ar(Wi). 

In 



32 



S. ARLOT, G. BLANCHARD AND E. ROQUAIN 



Bernoulli. First, we have A w = E|Wi - W\ = E|(1 - ±)Wi - X n J with 
-Xn,p := ^(^2 + • • • + W^). Since Wi and X ntP are independent and X n ^ p G 
[0, (n — l)/(np)] a.s., we obtain 



,4^ = pE 



n J p 



X, 



n.p 



+ (l-p)E[X n , i 



1-- + 

re 



1 - 2p)E[X n , ; 



The formula for follows since E[A" njP ] = (n — l)/n. Second, note that 
the Bernoulli(p) weights are i.i.d. with Vav(W 1 )=p- 1 - 1, E[Wi] = 1 and 
E\Wi - 1| =p{p~ 1 - 1) + (1 -p) = 2(1 -p). Hence, (32) and (33) lead to 
the bounds for Bw and Cyy- Finally, the Bernoulli(p) weights satisfy the 
assumption of (6) with xq = a = (2p)~ l . Then 



D 



W 



1 

27 



+ E 



W 



1 

27 



< 



1 

2p 



1 



1 



+ EIW-1I 



1 1 

Ip p 



2~ P 



+ 



P 



up 



Efron. We have W = 1 a.s. so that C\y = y x Var(Wi) = ^Jnjq. If, 
moreover, q < n, then Wi < 1 implies that VFj = and Aw = E\W\ — 1| = 
E[Wi - 1 + 21{W! = 0}] = 2P(Wi = 0) = 2(1 - i) 9 . The result follows from 
(32). 

Poisson. These weights are i.i.d. with Var(IUi) = /z -1 , E[Wi] = 1. More- 
over, if (j, < 1, TU; < 1 implies that = and E| W\ - 1| = 2P(TUi = 0) = 
2e - ^. With (32) and (33), the result follows. 

Random hold-out. These weights are such that {Wj}i<j< n takes only 
two values, with W = 1. Then Aw, Bw and Cw can be directly computed. 
Moreover, they satisfy the assumption of (6) with x$ = a = n/(2q). The 
computation of Dw is straightforward. □ 
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SOME NONASYMPTOTIC RESULTS ON RESAMPLING IN HIGH 
DIMENSION, II: MULTIPLE TESTS 1 

q | By Sylvain Arlot, Gilles Blanchard 2 and Etienne Roquain 

(N ■ 

CNRS ENS, Weierstrass Institute Berlin and UPMC University of Paris 6 

Ch ■ 

i CS • In the context of correlated multiple tests, we aim to nonasymp- 

totically control the family-wise error rate (FWER) using resampling- 
type procedures. We observe repeated realizations of a Gaussian ran- 
dom vector in possibly high dimension and with an unknown co- 
variance matrix, and consider the one- and two-sided multiple test- 
■ ing problem for the mean values of its coordinates. We address this 

| problem by using the confidence regions developed in the compan- 

ion paper [Ann. Statist. (2009), to appear], which lead directly to 
single-step procedures; these can then be improved using step-down 
d , algorithms, following an established general methodology laid down 

by Romano and Wolf [J. Amer. Statist. Assoc. 100 (2005) 94-108]. 
This gives rise to several different procedures, whose performances 
are compared using simulated data. 

m 
> . 

. 1. Introduction. 

r- 

' 1.1. Framework and motivations. We consider a sample Y := (Y , . . . , Y n ) 

£NJ ■ of n > 2 i.i.d. observations of a Gaussian vector with dimensionality K, pos- 

sibly much larger than n. The common covariance matrix of the Y* is not 
assumed to be known in advance. We investigate the two following multiple 
testing problems for the common mean \i £ of the Y* : 



o 



k> I • One-sided. Test simultaneously iT& : "/Ufc < 0" against : "/Ufc > 0" for 1 < 

"S- k<K; 
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• Two-sided. Test simultaneously : = 0" against Ak : "/i^ ^ 0" for 1 < 

For simplicity, we introduce the following notation to cover both cases: 

, . test simultaneously : = 0" against A\. : 7^ 0" 

W for l<k<K, 

where, for x € R, [a;] denotes either max{x, 0} = x + in the one-sided context 
or \x\ in the two-sided context. 

In this paper, we tackle the problem (1) by building multiple testing 
procedures which control the family- wise error rate (FWER) . We emphasize 
that: 

• we aim to obtain a nonasymptotic control, valid for any fixed K and 
n, and, in particular, with K possibly much larger than the number of 
observations n; 

• we do not want to make any particular prior assumption on the structure 
of the covariance matrix of the Y* . 

As explained in [1], this point of view is motivated by some practical ap- 
plications, especially neuroimaging [5, 10, 11]. Multiple testing problems in 
this field typically have parameters 10 4 < K < 10 7 , n < 100, with strong and 
complex dependencies between the coordinates of Y\ Another motivating 
example is microarray data analysis (see, e.g., [8]). 

1.2. Goals. In this work, we consider thresholding-based procedures which 
reject the null hypotheses for indices k £ R a (Y) C H := {1, . . . , K} cor- 
responding to large values of [Y^], where Y^ = n^ 1 J21=i^k denotes the 
vector of empirical means, that is, 

(2) J R a (Y) = {l<A:<^|[Y fe ]>t a (Y)}, 

where t a (Y) is a possibly data-dependent threshold. 

The type I error of such a multiple testing procedure is measured here by 
the family- wise error rate (FWER), defined as the probability that at least 
one hypothesis is wrongly rejected: 

FWER(i? Q ) := P(i? a (Y) n H ^ 0), 

where T~Lq := = 0} is the set of coordinates corresponding to the 

true null hypotheses. The choice of this error rate is discussed in Section 
5.1. Given a level a G (0,1), the goal is now to build a multiple testing 
procedure R a such that FWER(i? Q ) < a is valid for all distributions in 
the family being considered (i.e., Gaussian with arbitrary mean vector and 
covariance matrix); furthermore, as many false hypotheses as possible should 
be rejected. 
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To this end, we use the family of (1 — a)-resampling-based confidence 
regions for fi introduced in the companion paper [1]. Of interest here are 
regions taking the following form: for some subset C C H, 

(3) G(Y, l-a,C) := {x € R K \ sup[Y fc - Xfc ] < t a (Y,C)}, 

where t a is a data-dependent threshold built using a resampling principle. 
Several possible choices for this threshold were proposed. The main results 
of [1], as well as the link between confidence regions (3) and (single-step) 
multiple tests for (1), are briefly recalled in Section 2. 

1.3. Contribution in relation to previous work. Most of the existing resam- 
pling-type multiple testing procedures have been developed in an asymptotic 
framework (see, e.g., [8, 13, 17-19]), while our present goal is to study proce- 
dures that have nonasymptotic theoretical validity (for any K and n). The 
main classical alternative approach to asymptotic validity is to use an in- 
variance of the null distribution under a group of transformations — that is, 
exact randomized tests [14-16] (the underlying idea can be traced back to 
Fisher's permutation test [7]). Additionally, and as explained in [16], exact 
tests can be combined with a step-down algorithm to build less conserva- 
tive procedures while preserving the same nonasymptotic control on their 
FWER (also, see [17] for a generalization to the fc-FWER). 

In the case considered here, Gaussian vectors Y l have a symmetric dis- 
tribution around their mean so that the action of mirroring any subset of 
the vectors in the data sample with respect to their mean constitutes such a 
group of distribution-preserving transformations. In the two-sided case, this 
group is known under the global null hypothesis fx = and just corresponds 
to arbitrary sign reversal of each data vector. 

Consequently, it is possible to directly derive from [16] a step-down pro- 
cedure whose FWER is controlled in a nonasymptotic setting (see Section 
3) . This approach will be referred to as uncentered in this paper because the 
sign reversal is applied to the (Y l )i<j< n themselves, without prior centering. 

We observe that the principle of sign reversal was also used in [6] in 
order to build an adaptive (single) test for zero mean under the assumption 
of symmetric and independent errors. The setting studied here is different 
since we consider multiple testing with possibly dependent errors. 

Compared to this uncentered approach, most of the procedures proposed 
in this paper consist of applying the sign reversal to the empirically centered 
data (Y l — Y)i<j< n . It was proven in [1] that such an intuitive idea is 
theoretically valid, despite the dependencies between the Y l — Y, 1 < i < n, 
at the cost of adding a second order remainder term. We argue in the present 
paper that in some interesting situations, the prior centering operation leads 
to a noticeable decrease of the computation time of the step-down algorithm, 
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up to some small loss in accuracy (due to the remainder term) with respect 
to the uncentered step-down. Additionally, the centered approach can be 
used both in the one-sided and two-sided contexts, while the uncentered 
approach has, to the best of our knowledge, only been proven valid in the 
two-sided case. 

1.4. Notation. Let us now introduce some notation that will be used 
throughout this paper. 

• Y denotes the K x n data matrix {Y l ^)\<k<K ,i<i<n- A superscript index 
such as Y* indicates the iih column of a matrix. The empirical mean 
vector is Y := ^ X^=i^™- If M ^ R^> Y — fi is the matrix obtained by 
subtracting /i from each (column) vector of Y. 

• The vector a := {o~k)i<k<K is the vector of the standard deviations of 
the data: Vfc, 1 <k<K, a k := Var 1 / 2 (Y^). For C C H, we also define 
Ikllc :=sup fc6C (7 fe . 

• $ is the standard Gaussian upper tail function: if X ~ A/"(0, 1), then Vx G 
R, ®(x)=F(X>x). 

• If W G W 1 , we define the mean of W G W 1 as W := ± Ya=i w i and for 
every c£K, W — c:= (Wi — c)i<;<„ G R n . 

• For a subset C CH, \C\ denotes the cardinality of C. 

2. Single-step procedures using resampling-based thresholds. 

2.1. Connection between confidence regions and FWER control. We start 
with recalling a simple device linking confidence regions to FWER control 
in multiple testing. In a nutshell, the idea is that a confidence region of 
the form (3) directly gives a multiple testing procedure R with controlled 
FWER when taking C = Hq. Since Ho is not known in advance, we actually 
need a confidence region (3) defined for every CCH and satisfying certain 
properties. 

More formally, let a G (0,1) be fixed and T a = (i Q (Y,C),C C H,Y G 
M. Kxn ) be a family of thresholds indexed by subsets C C %. We consider 
threshold families satisfying the two following key properties. First, t a (Y, Hq) 
is a 1 — a confidence bound on the deviations of sup fc6 % [YJ: 

(CB Q ) p(sup[Y fc ]<t Q (Y,^ )) >!-«• 

Second, T a is nondecreasing w.r.t. C, that is, 

(ND) VYgM a ',VC C'cH,CcC => t a (Y,C) <t a {Y,C). 

We now define a single-step multiple testing procedure and establish its 
FWER control. 
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Proposition 2.1. Define the single-step multiple testing procedure as- 
sociated with 7~ a as the procedure rejecting the set of hypotheses given by 

(4) {ken\{Y k ]>t a (Y,H)}. 

If the threshold family satisfies (CB a ) and (NT)), then the FWER of the 
associated single-step procedure is controlled at level a. 

Proof. We first use (ND), then (CB a ): 

F(3k\{Y k ]>t a (Y,n) and \ji k ] = 0) 



□ 



S up[Y k ]>t a (Y,n) 

k&Ho 

<p(sup[Y fc ]>t a (Y,^ )) <a. 

V fee-H ' 



Note that the single-step procedure only uses the value of the largest 
threshold among the t a (Y,C), C C %. In Section 3, we use the iterative step- 
down principle to improve the procedure by making use of the thresholds 
t a (Y,C) for some smaller CcH. 

The condition (CB a ) is, in particular, satisfied whenever, for any C C H, 
t(Y,C) provides a 1 — a confidence region of the form (3) for (fJ,k)keC- We 
use this idea next to derive testing thresholds from the confidence regions 
constructed in [1]. 

2.2. Resampling thresholds. We first give a compact recapitulation of 
resampling-based thresholds introduced in [1] and used to build confidence 
regions for the mean of a high-dimensional, correlated vector. This is in- 
tended as a single overall reference for all of the thresholds that we use in 
the present paper. Here, and in the following, W E K n denotes a random 
vector independent from the data Y, called the resampling weight vector. 
Moreover, in order to simplify the results of [1], we specifically assume that 
the Wi are i.i.d. Rademacher random variables, that is, that they satisfy 
W(Wi = 1) = F(Wi = -1) = 1/2. As first building blocks, define the two fol- 
lowing resampling quantities, the (scaled) resampled expectation and quan- 
tile: 



(5) 



£(Y,C) :=B^E 



AY 



sup 

fcec 



n 



(6) (fo(Y,C):=inf^xe: 



"w 



sup 

, fcec 



1 n 

' :1 



> x < a 



where B w := E w [(± £"=i(Wi - W)) 1 / 2 } &ndE w [] [resp., P w (-)] denotes the 
expectation (resp., probability) operator over the distribution of W only. We 
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Table 1 

Reference table for the different rejection thresholds 



(-) 



(8) 



(9) 



i Q ,Bonf(Y,C) := 4=lkllc$ *( " 



with 



c = 1 (one-sided case) 
c = 2 (two-sided case) 



i Q ,conc(Y,C) := £(Y - Y,C) + ||a|| c $ (a/2) 



1 1 



ta.concABonf 



f(Y,C) := min t Q (i-«),Bonf (Y,C), 



£(Y - Y,C) + M^^ 1 f + J^<F 1 ( ^ 

v ' ; Jn \ 2 J nB w \ 2 



(10) t;, qua nt(Y,C) := 9q (Y-Y,C) 

(11) tct ,quant + Bonf (Y,C) :=t Q!q(1 — <5), quant 

(Y,C) + 7n(ao5)t Q - QO ,Bonf(Y,C) 

(12) t Q ,quant+conc(Y,C) := t ao (1-$) jqua nt (Y, C) + J n {c<o8)t a - ao jC onc ( Y , C) 

( 13) ic* , quant .unccnt (Y,C) := 9 a(Y,C) 



also define the following function which is the upper quantile function of a 
binomial (n, i) variable: 




B(n, rfj := maxj^ k € {0, . . . , n} 

Finally, we define the factor 

_ = 2g(n,r?/2)-n < / 21og(2/r ? ) \ 1/2 
n — \ n / 

where the last inequality, intended as a more explicit formula, is obtained 
via Hoeffding's inequality. 

Table 1 gives a reference for the different rejection thresholds considered in 
this paper, depending on a target type I error level a, subset of coordinates 
C and possibly on two arbitrary parameters ao G (0, a) and 8 G (0,1). The 
threshold (7) is Bonferroni's for Gaussian variables. Thresholds (8), (9), (11) 
and (12) were introduced in [1]. More precisely, threshold (8) is based on 
a Gaussian concentration result. Threshold (9) is a compound threshold 
which is very close to the minimum of (7) and (8). Threshold (10) is a raw 
resampled quantile for the empirically centered data; it has not been proven 
theoretically that this threshold achieves the correct level (this is signalled 
by the star symbol). The thresholds (11) and (12) are based on the latter 
with an additional term which was introduced in [1] to compensate (from a 
theoretical point of view) for the optimism in centering the data empirically 
rather than using the (unknown) true mean. The thresholds (7), (8) and (9) 



RESAMPLING TESTS IN HIGH DIMENSION 



7 



[and thus (11) and (12)] depend on the quantity ||c||c; if it is unknown, a 
confidence upper bound on \\<j\\c can be built (see Section 4.1 of [1]). Finally, 
note that all of these thresholds are nondecreasing w.r.t. C, that is, they 
satisfy assumption (ND). The nonasymptotic theoretical results obtained in 
[1] in the Gaussian case can be summed up in the following theorem. 

Theorem 2.2. Ift a (Y,C) is one of the thresholds defined by (7), (8), 
(9), (11) or (12), then it holds for any C CH, in the one-sided as well as 
the two-sided setting, that 



In particular, all of these thresholds satisfy (CB a J, both in the one-sided and 
two-sided cases. 

Note that the results obtained in [1] have more generality. In particular, 
variations on the above thresholds were proposed for non-Gaussian, but 
bounded, data and weight families different from Rademacher weights can 
be used in (8) and (9). For the purposes of the present work, we restrict 
our attention to Gaussian data and Rademacher weights for simplicity. It 
is straightforward to show that (14) implies (CB a ): the two-sided case is 
obvious since fj,k = for k € Ho; the one-sided case is an easy consequence 
of the fact that the positive part is a nondecreasing function. Therefore, by 
an application of Proposition 2.1, the corresponding thresholds t a (Y,H) for 
the full set of hypotheses can be used for multiple testing in the one-sided 
as well as two-sided setting with a nonasymptotic control of the FWER. 

We mentioned above that the thresholds (11) and (12), based on a re- 
sampled quantile for the empirically centered data (Y — Y), include an 
additional term in order to upper bound the variations introduced by the 
centering operation. In the context of testing, however, it is important to 
note that the quantile for the uncentered data defined in (13) is (without 
modification) a valid threshold in the two-sided setting. 

Theorem 2.3. Assume only that Y has a symmetric distribution around 
its mean fj,, that is, that (Y 1 — /i) ~ (/x — Y 1 ). If fj,^ = for all k EC, then 
the threshold t^quant.uncent (Y, C) defined by (13) satisfies (14)- In particular, 
the threshold i a , qU ant.unccnt(Y, C) satisfies (CB a ) in the two-sided setting. 

This result can probably be considered to be well known and corresponds, 
for example, to Lemma 3.1 in [1]. Again by Proposition 2.1, the threshold 
defined by (13) can therefore be used for multiple testing (although only for 
the two-sided setting). 



(14) 
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It is useful at this point to carry out a brief qualitative comparison of the 
uncentered quantile threshold (13) and the centered quantile thresholds (11) 
and (12) (in the two-sided setting). The obvious differences between the two 
types of thresholds are: 

• the data vectors are not centered around the empirical mean Y prior to 
computing the threshold (13); 

• the centered thresholds (11) and (12) have an additional additive term 
with respect to the main resampled quantile; furthermore, the main cen- 
tered quantile is computed at a shrunk error level ao(l — 5) <a. 

The second point is a net drawback of the "centered" family compared to the 
"uncentered" one. On the other hand, empirical centering of the data has 
the advantage of making the corresponding threshold t a (Y,C) translation 
invariant, that is, for every Y £ M Kxn and x G R^, the following property 
holds: 

(TI) VCcH t a (Y + x,C)=t a (Y,C). 

This property is also shared by the concentration-based thresholds (8) and 
(9). Therefore, large values of nonzero means Hk do not affect these thresh- 
olds. To understand the practical consequences of this point, let us consider 
the following informal and qualitative argumentation. If some coordinates of 
(Y^,)fc g c have a large mean relative to the noise (i.e., a large signal-to-noise 
ratio or SNR), then the corresponding coordinates of Y will have, on aver- 
age, a large absolute value relative to the coordinates with zero mean and 
the contribution of the former to the threshold will make the uncentered 
quantile significantly larger. In contrast, the centered quantile threshold is 
translation invariant and thus unaffected by the signal itself. Hence, in this 
situation, it is likely that the centered quantile threshold will be smaller. 
This effect is illustrated in the simulations presented in Section 4. 

3. Step-down procedures. Single-step procedures can often be improved 
by iteration based on the step-down principle. Roughly, the idea is to re- 
peat the multiple testing procedure with T~L replaced by H \ i? Q (Y) and to 
iterate this process as long as new coordinates are rejected. Again, consider 
a threshold family % = {t a (Y,C),C C U, Y € R Kxn ) satisfying (CB a ) and 
(ND). 

Definition 3.1. Consider the nonincreasing sequence (Cj,j > 0) of sub- 
sets of % defined by 

C :=n and Vj > 1 Cj := {k 6 C,-_i|[Y fc ] < f (Y,C,-_i)}, 
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and let I be the stopping rule i = min{j > l\Cj = Cj-i}- Then the step-down 
multiple testing procedure associated with T a rejects the hypotheses of the 



A very general result on step-down procedures was established in [16], 
Theorem 3, which we reproduce here using our notation. 

Theorem 3.2 (Romano and Wolf [16]). Let T a be a threshold family sat- 
isfying (ND). The FWER of the step-down procedure (15) is then controlled 



v fceH ' 

Therefore, if T a additionally satisfies (CYi a ), the FWER of the associated 
step-down procedure is upper bounded by a. 

A sketch of the proof can be given as follows: assume that Y is such 
that sup fcgWo [Yfc] < t a (Y,'Ho). Then Ho C Cj_i implies that t a (Y,Cj-i) > 
t a (Y,Ho) > sup fcgWo [Yfc] and, in turn, Ho C Cj, by definition of Cj. By 
recursion, Ho is contained in Cj for every j and the step-down procedure 
therefore makes no type I error on the event {sup fe6Wo [Y&] < t a (Y ,Ho)} ■ 

A direct consequence of Theorem 3.2 is that the step-down procedures 
based on any of the thresholds considered in the previous section [defined 
by (7), (8), (9), (11), (12) or (13)] have a FWER controlled at level a [for 
(13), only in the two-sided setting]. Note that the step-down procedure based 
on Bonferroni's threshold (7) is exactly Holm's procedure [9]. 

Parallel to the discussion at the end of Section 2.2, we can carry out a 
short qualitative comparison of the step-down procedure based on the un- 
centered quantile threshold (13) and the step-down procedures based on the 
centered quantile thresholds (11) and (12) (in the two-sided setting). Again, 
if some coordinates have a large SNR, then they certainly contribute to mak- 
ing the uncentered quantile threshold significantly larger at the first step of 
the step-down procedure. This time, however, even if this first threshold is 
relatively large, it will still be able to rule out at the first step precisely 
those coordinates having the largest means. This, in turn, will result in an 
important improvement of the uncentered threshold at the second iteration, 
and so on, until all coordinates with a large SNR have been weeded out. 
Thus, the initial disadvantage of the uncentered threshold will be automat- 
ically corrected along the step-down iterations and the final threshold will 
be close to the ideal resampled quantile q a (Y,'Ho) in the last iterations. In 
contrast, the centered thresholds (11) and (12) still suffer from the loss due 




{k€H\[Y k ]>t a (y,C,)}. 



by 
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to the remainder term and level shrinkage along the step-down. In conclu- 
sion, in contrast to the single-step case, we expect the uncentered procedure 
to eventually outperform the centered ones after some step-down iterations. 
This is in accordance with the simulations of Section 4. 

At this point, it could seem that the uncentered step-down procedure 
is both simpler and more effective than the centered step-down ones and 
thus should always be preferred. However, the above discussion gives us 
another insight: the step-down procedure based on the uncentered quantile 
should require more iterations to converge because the first iterations return 
inaccurate thresholds. In order to deal with this drawback, we propose using 
the leverage of the centered quantile thresholds for the first step — weeding 
out in a single step most of coordinates having a large SNR — and then 
subsequently continuing with the uncentered threshold in the next steps for 
greater accuracy. We obtain the following new algorithm. 

Algorithm 3.3 (Hybrid approach). 

1. Compute the threshold i a ,q U ant+Bonf (Y,"H) defined by (11) with a given 
8 € (0, 1), ao € (0,a) and consider i?o, the corresponding single-step pro- 
cedure (4). 

2. If Rq = ~H, then stop and reject all of the null hypotheses. Otherwise, 
consider the set of remaining coordinates Cq = T-L\Rq and apply to it the 
step-down procedure associated with the threshold £ a o,quant.uncent(Y,C) 
defined by (13) (at level ao). 

Proposition 3.4. Fix 5 e (0,1) and ao G (0,a). In the two-sided con- 
text, Algorithm 3.3 gives rise to a multiple testing procedure with a FWER 
upper bounded by a. 

Proposition 3.4 is proved in Section 6. What we expect is that Algorithm 
3.3 essentially yields the same final result as the step-down procedure using 
the uncentered quantile (up to some negligible loss in the level by taking 
ao close to a), while requiring less iterations. In applications such as neu- 
roimaging, where a single iteration can take up to one day, this can result 
in a significant improvement. 

4. Simulation study. The (MATLAB) code used to perform the simula- 
tions of this section is available on the first author's webpage (currently at 
http : //www . di . ens . f r/~arlot/code/CRMTR . htm). 

4.1. Framework. We consider data of the form Yt = /it + Gt, where t be- 
longs to a d X d discretized two-dimensional torus of K = d 2 pixels, identified 
with = (Z/tfZ) 2 , and G is a centered Gaussian vector obtained by two- 
dimensional discrete convolution of an i.i.d. standard Gaussian field (white 
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Gaussian kernel convolution, K-128X128, n-1000, level 5% 




5 10 15 20 25 30 35 40 

Convolution kernel width (pixels) 



Fig. 1. Left: example of a 128 x 128 pixel image obtained by convolution of Gaussian 
white noise with a pseudo-Gaussian filter with width b = 18 pixels. Right: average thresholds 
obtained for the different approaches; see text. 

noise) on T 2 , with a function F:T^ -> R such that T^teT 2 = L This 

ensures that G is a stationary Gaussian process on the discrete torus; it is, 
in particular, isotropic with E[G 2 ] = 1 for all t £ T 2 . 

In the simulations below, we consider, for the function F, a pseudo- 
Gaussian convolution filter of bandwidth b on the torus: Fb(t) = Cb exp(— d x 
(0,i) 2 /6 2 ), where d(t,t') is the flat Riemannian distance on the torus and Cb 
is a normalizing constant. We then compare the different thresholds obtained 
by the methods proposed in this work for varying values of b. Remember 
that the algorithms considered here have no prior knowledge on the specific 
form of the function Ff, and would work in other, more complex, dependency 
contexts. 

We consider the two-sided case only. In all of the simulations to come, 
we fix the following parameters: the dimension is K = 128 2 = 16,384, the 
number of data points per sample is n = 1000 (hence significantly smaller 
than K) and the width b takes even values in the range [0, 40] (b = is white 
noise; see the left-hand side of Figure 1 for an example of noise realization 
when b = 18). The target test level is a = 0.05. We report the (empirical) 
expectation of each threshold over 250 draws of Y. 

For computation of the thresholds (9), (11) and (12), we have to choose 
some parameters 5 € (0, 1) and (for the two latter ones) ctQ < a. In each case, 
these parameters establish a trade-off between a main term and a remainder 
term; generally speaking, as n grows, one should choose 5 — > and ao a 
so that the level of the main resampled term tends to the target level a. In 
[1], it was suggested to take 5 of order 1/n and (1 — — ) of order n~ 7 for 
some 7 > to ensure that the remainder terms are indeed of lower order, 
but there is no exact recommendation for fixed n. In all of the simulations 
below, we decided to fix 5 = (1 — ^-) = 0.1, without particularly trying to 
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optimize this choice. When varying these parameter values, we noticed that 
the results were not overly sensitive to them. Finally, for all of the thresholds, 
the resampling quantities (quantiles or expectations) are estimated by Monte 
Carlo with 1000 draws (but we disregarded the additional terms proposed 
in [1] to account for the Monte Carlo random error). 

4.2. Simulations with unspecified alternative: Single-step, translation in- 
variant procedures. We first study the performance of the multiple test- 
ing procedures which have a translation invariant threshold (TI), that is, 
the single-step procedures using thresholds (7), (8), (9), (11) and (12), de- 
noted, respectively, by "bonf," "cone," "cone A bonf ," "quant + bonf" and 
"quant + cone." Their distributions do not depend on the true mean vector 
\x chosen to generate data and we have fixed fi = without loss of generality. 
Provided that the FWER constraint is satisfied, procedures with a smaller 
threshold are less conservative and hence more powerful. 

In Figure 1, we report the (averaged) values of each threshold. In this 
figure, we did not include standard deviations: they are quite low, of the 
order of 10 -3 , although it is worth noting that the quantile threshold has 
a standard deviation roughly twice as large as the concentration threshold. 
For comparison, we also included an estimation of the true quantile, that is, 
the 1 — a quantile of the distribution of sup fcg -^ | — /i& | (more precisely, an 
empirical quantile over 1000 samples), denoted by "ideal." The exact thresh- 
old corresponding to K = 1 (test of a single coordinate Gaussian mean) is 
also included for comparison and is denoted by "single." In the context of 
this experiment, we also computed the threshold (10), that is, the raw sym- 
metrized quantile obtained after empirical recentering of the data (for which 
no nonasymptotic theoretical results are available). This threshold was not 
reported in the plots because it turns out to be so close to the true quantile 
that they are almost indistinguishable. 

The overall conclusion of this first experiment is that the different thresh- 
olds proposed in this work are relevant: they improve on the Bonferroni 
threshold, provided the vector has strong enough correlations. As expected, 
the quantile approach appears to lead to tighter thresholds. (This might, 
however, not always be the case for smaller sample sizes because of the ad- 
ditional term.) One remaining advantage of the concentration approach is 
that the compound threshold (9) falls back on the Bonferroni threshold when 
needed, at the cost of a minimal threshold increase. Finally, the remainder 
terms introduced by the theory in the centered quantile thresholds appear 
overestimated since the raw resampled quantile is, in fact, extremely close 
to the true quantile. 
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4.3. Simulations with a specific alternative. We consider the experiment 
of the previous section, with the following choice for the vector of true 
means: 

(16) V(i,j)G{0,...,127} 2 H(i,j)= (64 ^" 4 J)+ >< 20t a ,Bonf(K)- 

In this situation, half of the null hypotheses are true, while the nonzero 
means are increasing linearly from (5/16)i a> B on f (H) to 20i Q!i Bonf The 
thresholds obtained are displayed in Figure 2, along with the averaged power 
of the corresponding procedures, defined as the expected proportion of signal 
correctly detected (i.e., averaged proportion of rejections among the false 
null hypotheses). 

In this experiment, we concentrated on the quantile-based thresholds. We 
picked the threshold (12) "quant + bonf" as a representative of the centered 
quantile approach and its step-down counterpart, denoted "s.d. quant + 
bonf." We compare these to the uncentered quantile threshold (13), denoted 
"quant. uncent," and its step-down version, "s.d. quant. uncent." Bonferroni's 
threshold and its step-down version "holm" are included for comparison. 
The threshold denoted "ideal" is now derived from the 1 — a quantile of the 
distribution of sup fe „ |Y^| and corresponds to the optimal threshold for 
FWER control. 

The results of the experiment can be summarized as follows: 

• The single-step centered quantile procedure "quant + bonf" outperforms 
Holm's procedure provided the coordinates of the vector are sufficiently 
correlated. Its step-down version "s.d. quant + bonf" performs even better, 
although the difference is not huge. 

• The single-step procedure based on the uncentered quantile "quant. uncent' 
has the worst performance, confirming the qualitative analysis following 
Theorem 2.3. 



Gaussian kernel convolution, K=128x128, n=1000, level 5% 
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- « - quant+bonf 

- a - s.d. quant+bonf 
ideal 
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Fig. 2. Multiple testing problem with /i defined by (16) for different approaches; see text. 
Left: average thresholds. Right: average power. 
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• The step-down procedure based on the uncentered quantile "s.d. quant. uncent 
seems, on the other hand, to be the most accurate among the procedures 
considered here, also in accordance with the qualitative analysis following 
Theorem 3.2. 

The latter point must be balanced with computation time considerations. 
When K and n are large, the step-down algorithm for the uncentered quan- 
tile takes longer to compute because of its iterative nature, while the single- 
step centered quantile procedure "quant + bonf" provides a relatively good 
accuracy without iterating. This brings us to the next point. 

4.4. Hybrid approach. We show here, with a specific simulation study, 
that the hybrid approach proposed in Algorithm 3.3 results in a speed/accuracy 
trade-off which is particularly noticeable when the mean values take on a 
large range. 

Consider the same simulation framework as above, except that the band- 
width b is now fixed at 30, the size of the sample is n = 100 and the means 
are given as follows: V(i, j) G {0, . . . , 127} 2 , fJ-Uj) = f{i + 128?) , where 

Vfc € {0, ... , 128 2 /2} f(k) = 0.5t a>Boa{ (H) 

(17) 

/128 2 /2 -fc r, , A 

and f(k) = for the other values of k. In this situation, the 128 2 /2 nonzero 
means are decreasing exponentially between 0.5t a Bonf 

{U)W/ W and 

0-5t a] Bonf (W), where r is the dynamic range (in dB) of the signal. 

In Figure 3, we have computed, for several values of r, the average number 
of iterations for the above step-down procedures, as well as their power when 
these procedures are stopped early after at most t iterations (such an early 
stopping is relevant in the case of a strict computation time constraint). We 
can sum up these results as follows: 

• The hybrid approach needs, on average, significantly less iterations to 
converge when r > 10. 

• Stopping the hybrid approach procedure after only two iterations results 
in an average power that is virtually indistinguishable from the power 
obtained without early stopping, uniformly over values of r. In contrast, 
as r increases, more iterations are needed for the step-down quantile un- 
centered threshold in order to reach full power. 

While these results are certainly specific to the particular simulation setup 
we used, they illustrate that the informal and qualitative analysis presented 
in Section 3 is correct when the signal (nonzero means) has a wide dynamic 
range. In particular, the fact that the hybrid approach already gives very 
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Signal range (dB) Signal range (dB) 



Fig. 3. Multiple testing problem with jj, corresponding to (17) for the step-down procedure 
based on the uncentered quantile (sdqu) and the hybrid step-down approach. Left: average 
number of step-down iterations. Right: average of the ratio of power to maximum power 
when the step-down is stopped after at most t iterations. Here, the maximum power is 
taken to be the power of "sdqu" without early stopping. (For the hybrid approach, the first 
step counts as one iteration.) 

satisfactory results after the first two iterations reinforces the interpretation 
that the first step (using the centered quantile threshold with remainder 
term) immediately rules out all coordinates with a large SNR, while the 
second step (using the exact, uncentered quantile) improves the precision 
once these high-SNR coordinates have been eliminated. 

5. Discussion and concluding remarks. 

5.1. Discussion: FWER versus FDR in multiple testing. It can legiti- 
mately be asked whether the FWER is an appropriate measure of type I 
error. The false discovery rate (FDR), introduced in [2] and defined as the 
average proportion of wrongly rejected hypotheses among all of the rejected 
hypotheses, appears to have recently become a de facto standard, in partic- 
ular, in the setting of a large number of hypotheses to test, as we consider 
here. One reason for the popularity of the FDR is that it is a less strict 
measure of error than the FWER and, to this extent, FDR-controlled pro- 
cedures reject more hypotheses than FWER-controlled ones. We give two 
reasons why the FWER is still a quantity of interest to investigate. First, the 
FDR is not always relevant, in particular, for neuroimaging data. Indeed, 
in this context, the signal is often strong over some well-known large areas 
of the brain (e.g., the motor and visual cortices). Therefore, if, for instance, 
95 percent of the detected locations belong to these well-known areas, FDR 
control (at level 5%) does not provide evidence for any new true discovery. 
On the contrary, FWER control is more conservative, but each detected 
location outside these well-known areas is a new true discovery with high 
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probability. Second, assuming that the FDR or a related quantity is never- 
theless the end goal, it can be very useful to consider a two-step procedure, 
where the first step consists of a FWER-controlled multiple test. Namely, 
this first step can be used as a mean to estimate the FDR or the FDP (false 
discovery proportion) of another procedure used in the second step and thus 
to fine-tune the parameters of this second step for the desired goal. This 
approach has been advocated, for example, in [3, 4] for finding FDR con- 
trolling procedures adaptive to the proportion of true nulls and in [12] to 
find specific regions in random fields, also with application to neuroimaging 
data. 

5.2. Conclusion. In this work, the main point was to introduce multiple 
testing procedures based on resampling thresholds (9), (11) and (12) coming 
from nonasymptotic confidence regions constructed in [1] . These confidence 
regions have theoretical control of the confidence level for any n, so the 
FWER of the corresponding multiple testing procedures is also controlled 
nonasymptotically. This issue is important in practice because the sample 
size is often much smaller than the number of tests to perform (K 3> n). 
Nevertheless, as the simulations of Section 4 suggest, remainder terms in the 
thresholds — precisely introduced to deal with this nonasymptotic setting — 
are overestimated by the theory and could probably be improved. 

Even in the presence of these corrective terms, we showed through exper- 
iments that these thresholds are able to capture the unknown dependency 
structure of the data and significantly outperform Holm's procedure when 
this dependency is strong enough. In comparison to exact randomization 
tests (based on an uncentered quantile), which also provide nonasymptotic 
level control, we argued that the empirical centering operation before ran- 
dom sign reversal results in translation invariant thresholds. These thresh- 
olds are, for this reason, unaffected by the unknown signal and thus already 
relevant for testing in the first iteration of the step-down algorithm. The 
method also applies to one-sided testing problems, where the uncentered 
approach is not theoretically justified as far as we know. Finally, the hybrid 
algorithm can approach the accuracy of the uncentered step-down threshold 
(which does not require corrective terms) while initially taking advantage of 
the centered threshold, resulting in a faster computation. 

For practical purposes, it is certainly tempting to recommend using a 
(step-down) procedure based on the raw, unmodified centered quantile with- 
out remainder terms (10): this would correspond to the principle of tra- 
ditional resampling. To this extent, and to rephrase the discussion in [1], 
nonasymptotic theoretical results can also be understood from an asymp- 
totic point of view, justifying the use of resampling (in a specific setting — 
Gaussian variables, test for the mean, Rademacher weights) for a regime 
that is not usually covered by traditional asymptotics (i.e., dimension K n 
increasing with n). 
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6. Proof of Proposition 3.4. First, note that q ao (Y,T-Co) < q ao (Y — fi,'H) . 
From the proof of Theorem 3.2 in [1], with probability greater than 1 — (a — 
ao), we have 

Qa (Y — ^,T~L) < ^a.quant+Bonf (Y, %). 

Take Y in the event where the above inequality holds. If the global procedure 
rejects at least one true null hypothesis, let jo denote the first time that this 
occurs (jo = if it is in the first step). There are two cases: 

• if jo = 0, thensupfcg^jYfel > i a , qU ant+Bonf (Y, H) > q ao (Y-fi,'H) > <?a (Y, 

n y, _ 

• if jo > 1, then sup fc6W() |Y fc | > t ao 

, quant, unccnt ( Y ,Cj _i) and Ho C Cj _i 
(from the definition of j ) so that sup fc6%) |Y fc | > t ao 

, quant. unccnt (Y,^ ) = 

q ao (Y,H )- 

In both cases, sup fcg -^ |Yfc| > q^iY^Ho), which occurs with probability 
smaller than an- 
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